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CHAPTER 1 


Introduction 

The approximate solution of partial differential equations often leads 
to large sparse systems of linear equations that must be solved numeri- 
cally. These systems can contain tens, or even hundreds of thousands 
of equations and require hours to solve on conventional mainframe com- 
puters such as the CDC CYBER 7600 series. 

With the advent of parallel architectures found in vector computers 
such as the CRAY-1 and CYBER 203/205 or arrays of microprocessors 

found in the ICL DAP <Cryer[1981]), the HEP (Smith[1978]) and NASA 
Langley's Finite Element Machine (Jordan[1978]), it may be possible to 
solve these problems in a shorter time. Also, with the cost of hardware 
continually decreasing, arrays of microprocessors may prove to be a 
cost-effective architecture for solving these large problems, especially with 
the use of VLSI (Very Large Scale Integration) or WSI (Wafer Scale 
Integration) technology. 

A major use of these equations Is In structural engineering. Prob- 

lems such as deflection of membranes are described by second order 
elliptic partial differential equations, and beam or plate bending problems 
are governed by fourth order elliptic equations. The usual way to solve 
these problems approximately is to first discretize the spatial domain by 
finite elements and then to solve the resulting system of linear equations 
by a direct solution technique, usually some variant of Cholesky decom- 
position (see e.g. Noor and Fulton[1974] and Reld[1980]) The linear sys- 
tem is often too large to fit completely in the computer's main memory. 
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especially after the fill-in due to the decomposition. Hence, these solu- 
tion techniques must include the moving of data between main memory 
and the backing store. This data handling requires efficient memory 
management and can be very time consuming. 

In this thesis, we investigate iterative algorithms for solving, on 
parallel computers, the large sparse symmetric and positive definite linear 
systems that arise from elliptic partial differential equations such as those 
from structural engineering. Iterative methods have the advantage that 
minimal storage space is required for implementation since no fill— in of 
the zero positions of the coefficient matrix for the system of linear equa- 
tions occurs during computation. Hence if many processors with 
memories are connected together and the data is distributed among 
them, it may be possible to solve large problems without moving large 
amounts of data Another advantage of an iterative method is that the 
process may converge in very few steps if a good initial guess is 
known This is the case In some applications Also, for certain three 
dimensional elliptic problems. Fix and Larsen[1971] show that iterative 
methods can outperform Cholesky decomposition on sequential computers 
Iterative methods seemingly parallelize better than direct methods and are 
therefore potentially viable techniques for solving large sparse linear sys- 
tems on parallel computers 

The thesis consists of eight chapters. Chapter 2 reviews the finite 

element method, points out aspects of it that are amenable to parallel 
computation and derives the system of linear equations for two example 
problems 
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Chapter 3 describes in detail the architecture of NASA Langley's 

Finite Element Machine. This machine is used to describe the imple- 

mentation of the parallel iterative algorithms. 

In Chapter 4. two algorithms are developed for the parallel assembly 
of the system of linear equations by the finite element technique for the 
Finite Element Machine. These assembly algorithms are then compared 
and their speedups relative to a single processor version are determined. 
The last section of this chapter describes how to perform a stress 
analysis in parallel on the Finite Element Machine once the solution to 
the linear system for the displacements is found. 

Chapter 5 describes several parallel linear stationary iterative 
methods that can be implemented on either vector computers or parallel 
arrays. The implementation of Jacobi's method is given in Section 5.1. 

Section 5.2 describes a new method, which we call Multi-color SOR. 

discusses its implementation on parallel machines, compares it to existing 
theory, and reports numerical comparisons to SOR without Multi-coloring. 
Section 5 3 describes a Multi-color SSOR method and its efficient imple- 
mentation on parallel architectures. Finally. Section 5 4 describes how to 
implement block iterative methods such as block Jacobi and block SOR 
on these machines. 

Chapter 6 describes parallel conjugate gradient methods. The 
implementation on the Finite Element Machine of the the standard conju- 
gate gradient method is given in Section 6.1. Section 6.2 1 describes 
the Implementation considerations for parallel preconditioned conjugate 
gradient methods and Section 6.2 2 lists some common preconditioners 
and discusses the difficulty encountered in their Implementation on paral- 
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lei machines. In Sections 6 2 3 and 6 2.4 we give preconditioners that 

are suitable for parallel machines, analyze when they can be applied, 
and relate them to the preconditioners of Dubois. Qreenbaum. and 
Rodrlquetl979] and Johnson[1981]. Section 6 2 5 gives numerical results 
for the preconditioners of Sections 6.2.3 and 6.2.4 on two example prob- 
lems. 

In Chapter 7 we develop a detailed model for comparing parallel 
algorithms on an architecture like the Finite Element Machine. This 
model is then used to analyze the algorithms in Chapters 5 and 6 as a 
function of the number of processors in the microprocessor array and 
also as a function of the machine's ratio of communication to arithmetic 
time. 

Chapter 8 summarizes the results of this work and describes areas 
for promising future research. 


CHAPTER 2 


The Finite Element Method 


This chapter gives a brief description of the finite element method, 
highlights the aspects of the method that are amenable to parallel com- 
putation. and derives the finite element equations for two specific prob- 
lems that are used In future chapters. 


2.1. Description of the Method 


The finite element method Is a general technique for constructing 
approximate solutions to boundary value problems (Oden[19811. Strang and 
Fix[1973]). Suppose we want to solve the following second order nonho- 
mogeneous elliptic partial differential equation with homogeneous Dirichlet 
boundary conditions' 


c 5 2 

- r (x)-r— i/ (x)) = f <x ) x efl 

/.;= i 2 3 */ " a ',- 

u(x) = o x can 


( 2 . 1 ) 


2 

where fl is a bounded domain in R and the matrix a^(x) is symmetric 
and uniformly positive definite. 


To approximate the solution of (2.1) by the finite element method, 
we first write (2.1) in its variational form: 


Find u(x)€H Q (n) such that 


a(u.v) = If .v) 


v eH (fl) 
o 


where 


( 2 . 2 ) 


5 
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and 


(w ,v) 


Jw (x)v(x)dx 
n 


a(u.v) = f E a..(x)-£-u (x) -^-vCx) dx 
fl /./*1 11 dx i dx l 

Here H 1 denotes the set of all functions veH 1 (fi) such that v=0 on an 
o 

where (n) is the set of all functions which together with their partial 

2 

distributional derivatives belong to L (TD. the space of square mtegrable 
functions on n. 


If we choose the matrix a ^ In (2.1) to be the identity matrix, we 
get Poisson's equation. 


-(u^ + u yy ) = / (x ,y) (x.y)€H 

uCx.y) = 0 (x,y)€3n 


(2.3) 


Likewise, the associated weak form of Poisson's equation can be 
obtained from (2.2). 


Find u(x.y)eH] (n) such that 
' o 

|Vu (x.y) • VvOr ,y)dxdy = |f (x .y)v (x .y)dxdy 

n n 


for all 


v(x .y)eH 


1 

o 


(fl) 


where Vw is the gradient of w. 


(2 4) 


We now consider the approximation of the solution of (2.4) by the 

finite element method. A finite dimensional subspace V h <zH° (n) must 

first be chosen. This subspace typically consists of piecewise polynomi- 

als defined over a triangulation of n, called f/ 7 . where each triangle is 
called a finite element If the subspace v/ 7 is spanned by the functions 
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Pj /= 1.2..../7. called basis functions, we look for an approximate solution 
u h (x.y ) of the form 


h 

u 


C x.y) 


n 

E a^.Or.y) 

/=! 1 1 


The substitution of (2.6) into (2.4) yields 

f Vi/ 1 (x ,y) • (x .y)dxdy = f f (x ,y)v /J (x ,y)dxdy 
lh Jb 

n n 

for all v h eV^ . 


( 2 . 6 ) 


(2.7) 


By choosing v h =-p f . 1=1 ,2,...n in (2.7), we get the following symmetric 
and positive definite system of linear equations for SL- 

Ksx. = L (2.8) 


where 



(2.9) 


and 


f = J f (x.yJ^Cr.y) dxcfy 

n" 

The power of the finite element method lies in the choice of the basis 
functions Pj. A linear polynomial is uniquely determined by its values at 
the three vertices of any triangle. Suppose Is chosen to be that 

piecewise linear polynomial which has the value 1 at vertex V / of the 
triangulation and has the value zero at the rest of the vertices of the 
triangulatlon. Then Pj is a continuous function which belongs to (fl) 
and is nonzero only on those triangles having V as a common vertex. 
In addition, any C° -piecewise linear polynomial may be represented as a 
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linear combination of the -P/s. Functions belonging to (f/ 7 ) are 
obtained by omitting those that are defined to have the value 1 at 

the Vj's on the boundary of f/\ The points of for which P t is 

defined to be either 1 or 0 are often referred to as nodes. If the P t 's 
are linear as described above, the vertices V / will be the nodes. 

Alternatively, a quadratic polynomial is uniquely determined by its 
values at the vertices and at the midpoints of the sides of the triangle. 

Likewise, values given at ten nodes located two per side of the triangle, 

one at each vertex, and one in the center of the triangle uniquely 

determine a cubic polynomial. 

By choosing the Pj's to be piecewise polynomials with the value of 

either 1 or 0 at the points of the triangulation, the value of a. in equa- 
tion (2.6) will be the value of u h (x.y) at the nodal point /. denoted 6j. 

and we can write equation (2.8) as 

K£l = L (2 10 ) 

The matrix K will be sparse due to the choice of the basis functions 
since the values of k ^ will be zero if nodes / and / are not on a 

common finite element. In other words, row / will have at most as 
many nonzero off diagonal entries as node / has neighbor nodes (two 
nodes are called neighbor nodes if they share a common finite element). 
To Illustrate this. Figure 1 shows a region discretized by triangular finite 


elements. 



Figure 1 . Region Discretization 
18 elements; 16 nodes 


Now. if linear piecewise polynomials are chosen for the basis functions, 
row 6 of K will have at most 6 off-diagonal entries; namely, k 62 , k Q5 . 
* 65 , * 67 , /Cg g . and fcg 1Q . This sparsity will be a major consideration 

in the design of parallel algorithms for solving K£=i. and will be 
addressed in Chapters 5 and 6. 

Each entry, k^. as defined by equation (2 9) is obtained by an 
integration over the domain n h . Since -Pj and Pj in (2.9) are both 
nonzero only on finite elements that contain both nodes i and /. this 
Integration is only performed over these particular elements. Figure 1 
shows that the integration is performed over six elements to calculate k^ 
if /=/ and two elements otherwise. As an example, suppose /= 6 and /= 7 
Then, from Figure 1 we obtain 

*67 = 7 X^ 6 "" 7 «•"> 

Likewise. If /= 6 and /= 6 we get 

' l 66 = X V '’ 6 ' 7 < ’ 6 t X 7< ’ 8 '^ I 7 '’ 6 ' 


( 2 . 12 ) 
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Lastly, if 1=6 and / =8 then 

*68 = 0 (2 13) 
since nodes 6 and 8 have no finite element in common. These obser- 
vations suggest a commonly used three step procedure for assembling 
the K matrix' 

(1) Zero out the storage that is to be used for K This will usually 
be a symmetric storage structure in which the diagonal and upper 
bands of K are stored. 


(2) Integrate over each element, one at a time, to calculate the 
element's contribution to the diagonal and upper bands of K 
For example, the integrations over element (1) in Figure 1 yield 
the following contributions to K: 


n 


r v -p- 1 • 

i) 

*12' 

J Vp • Vp 
(1) 

CM 

CVJ 


*15 

r • ?p 

(i) 

55 • 


r 

!) 5 5 

*25 

f vp 'Vp 
(D 


These values comprise what is commonly called the element 


matrix for element (1) and can be represented by 

1 2 5 



(3) The values in each element matrix are added to the appropriate 
position in the global K matrix 


This procedure can be adapted for a parallel computer with little 
modification since the Integrations over two different elements can be 
done simultaneously. This is the topic of Chapter 4. 

Thus far only the solution of a scalar partial differential equation by 
the finite element method has been considered; however, the method can 
be applied to a system of equations as well, in particular, the equa- 
tions that govern the static displacement of a body in plane stress will 

be a coupled sytem of two equations for the displacements of the body 

in the x and y directions respectively. The finite element method as 

applied to this problem will be described in the following section 

2.2. Plane Stress Equations 

The procedure for constructing the stiffness matrix K for a plane 
stress analysis of an isotropic linear elastic body n will be described in 
this section Similar descriptions may be found in Oden(1981]. Nome 
and deVriesn978). and Zlenkiewlczn971]. The problem is to find the 
displacements in both the x and y coordinate directions of a 2- 
dimensional body n that is m plane stress, such as the membrane 
shown in Figure 2. 
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No=o. 

Figure 2. Plate in Plane Stress 


First, we introduce the notation that will be used in this discussion 



stress vector 


strain vector 


displacement vector 



E=Young's modulus 
v=Poisson's ratio 
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N 


L 




n .n normal components 3fl p 

x y c. 


body forces per unit area 


Q. 



A 

u. 



surface tractions applied on 3 n 2 


displacements on 3f2^ 


The conservation of linear momentum for the body fi states that any 
portion w of the body must be in static equilibrium: 

J(D ^a+Ddxdy = 3. (2.14) 

(J 

For sufficiently smooth a and i_ we get the partial differential equations 

of equilibrium for the body n. 

D T a + L = a (2 15) 

A material that is linearly elastic, homogenous, and isotropic satisfies the 
constitutive equation which relates the stresses to the strains 

O. = E£. (2 16) 

The strains and displacements are related by 

£ = 0 si (2 17) 

The substitution of (2.16) and (2 17) into (2 15) yields the partial dif- 
ferential equations in terms of the displacements only 

D^EDiiOr.y) +!.<*./) = 0, Oc,y)efl (218) 
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The boundary conditions for (2.18) are. 


j£(s) = u. (s) s ean. 


(2.19a) 


Nfi= £L or NEDu=a(s) s 63 fl. 


( 2 . 19 b) 


The variational form of the problem in (2.18). (2.19a). and (2.19b) is 


easily found to be 


f (C&.) ^EDydxdy = fjJidxdy t f v 7 " &ds 

n n an„ 


(2.20) 


1 1 

for all v ( aH (fl), Uj nH (fl), 1=1,2 and n-a. on 3n 1 and x=£ on an^. 

Before we Introduce the finite element approximation to (2.20) an 
alternate derivation of (2.20) will be given. This derivation comes from 
the minimization of the potential energy of the body. The potential 
energy functional X(i*> Is given by 


X(W.) = (Diy) ^EfDvt )dxdy - j&’ldxdy - J jjJ" yds 


(2 21 ) 


which is to be minimized among all 


1 i 

%.eHg= m w.j eH on an^. 


To find the value of u.eH E that minimizes X, the first variation 6(X(ll.x.» 
is formed and set to zero. 

0 (X(uiy.))=^J (Dit)^E(Dy.)dxdy (Qy.)^EDudxdy -jVidxdy 


- f vj" gds = 0 ,/or all vettl. 

an„ E . 


(2.22) 


where =| v. J VjHH^ (fl) ,v.=0 on an^ j 


The boundary conditions (2 19b) are natural boundary conditions and do 
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not need to be explicitly imposed. They will automatically be satisfied by 
the mimmizer of (2.21). Now. since E=E 7 ", (2.22) becomes (2.20). 

The solution of (2.20) is approximated by the same technique 
described in the last section for Poisson's equation, that is, we choose 
a finite dimensional subspace t/ 1 c/-/ 1 (fi) and look for the approximation 
iLix.y) of the following form: 

iL (x ,y) = <ML (2.23) 


where 4> is a 2x2n matrix of piecewise continuous basis functions and £l 
is a 2 nx 1 vector containing the unknown values of the components of 
!L Or.y) at the n nodal points. In particular for linear basis functions, if 
we just consider the value of u. h (x.y) where (x.y) is a point inside a tri- 
angular finite element (e) with nodes 1.2. and 3. (2.20) becomes 


where 


and 


h (e) 

u. 


. . _(e)«(e) 

(x.y) = 9 £l 


(2 24) 




The P ( 's in (2 24) will be piecewise linear polynominals defined on tri- 
angular finite elements as in (2 9). Note that the same basis functions 
are used for both components of it in (2 24) since both components 
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are elements of the same subspace \Z* . 

The substitution of (2 24) into (2 20) yields the following element 
matrix and vectors: 


K (e) = J (D<l» e ) 7 ED 4> d dxdy 
n e 

l_ (e) = J (<J > e ) T f.dxdy (2 25) 

n 

= J (<J> ) 7 N ads 
3n 2 

The equations that govern the solution at a particular node / are 
the global equations 


Ki = i t 2 (2.26) 

and must be assembled from (2 25) by adding the contributions over 
each element to the appropriate position in K as was described in the 
last section. 

Suppose that once the displacements & are found, the stresses a 
are to be calculated. From (2 16). (2.17). and (2.23). the stress vector 
a can be expressed as 


a = ED4>& 


(2 27) 


Since the elements of <I> are linear. D<t> will be a matrix of constants 
This implies that the stresses will be constant over a given element. 
This can be seen since on a particular element e. <I> will contain only 
three nonzero basis functions irr each row that are associated with the 
three nodes of (e) Therefore, we may also write 


(e) r-rv* (e) * (e) 

a = ED4> £l 

(e ) (e) 

where 3> is a 2x6 matrix and j5l is a 6x1 vector 


(2 28) 
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(q ) (a ) 

Hence, once A is found the solution of si only requires one matrix 

(e ) 

multiplication. In fact, this stress matrix ED<I> can be saved from the 
calculation of K (e) in (2.25). 

(e ) 

As illustrated above, the calculation of a requires only the values 

(e ) (e ) 

of A and ED® . This suggests a parallel implementation of the 
stress calculation by elements is possible. A more detailed explanation 
of this is given in Chapter 4 for linear basis functions defined on a tri- 
angulation of a plate under plane stress 



CHAPTER 3 


The Finite Element Machine 

This chapter gives a brief summary of parallel architectures that 
have been built or are under development and then describes in detail 
the architecture of the Finite Element Machine. 

3.1. Review of Parallel Architectures 

For our purposes, it is convenient to classify parallel architectures 
into two categories, namely, vector computers and array computers. A 
vector computer will be considered to be a computer that has special 
hardware instructions which accept vectors as operands. These instruc- 
tions may be implemented via hardware pipelines as was the case for 
the TI-ASC and the CDC STAR 100 in the early 1970's and currently for 
the CDC CYBER 203 and 205 and the CRAY-1 Alternatively, the ele- 
ments of the vector may be loaded into separate processing elements as 
is the case for the ILLIAC-IV (The iLLIAC-IV may also be considered to 
be an array computer). A description of the architecture of these 
machines as well as the state of the art in 1977 of algorithms for solv- 
ing partial differential equations on them is given in Ortega and 
Voigtfl9771. 

An array computer consists of an array of processing elements each 
of which may execute instructions. These elements may be simple chips 
to perform specific functions or they may be complete processors 

Array computers fall into two classes depending on the manner in 
which they execute instructions (Fiynnfl 9761) In the first class, the pro- 
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cessing elements either all execute the same instruction or no instruction 
on different data. This class is called SIMD (single Instruction, multiple 

data). The ILLIAC IV is an example of a SIMD machine Alternatively, 
the processors may execute different instructions in an asynchronous 
fashion on different data. This class is called MIMD (multiple instruction, 
multiple data). 

During the course of a computation these processing elements must 
communicate with each other. Since providing a communication link 

from a processor to every other processor becomes prohibitive as the 
number of processors increase, a particular connection strategy also 
influences how we will further classify array architectures. Three major 
strategies as summarized by Ortega and Voigt[1983] are listed below 

(1) P processors are arranged in the form of a regular lattice. 

Each processor has its own local memory and is permanently 
connected to a small subset of other processors, called neigh- 
bors. 

(2) P processors with local memory are connected to each other by 

a bus. 

(3) P processors and M memories are connected by an an elec- 

tronic switch so that every processor has access to a subset, 
possibly all, of the memories. 

Lattice arrays include the ILLIAC-IV. the Distributed Array Processor 
(DAP) of International Computers Limited (Cryer(1981]). and the Systolic 

Arrays of Kung{1980] The ILLIAC-IV and the DAP are SIMD machines 
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The Systolic Arrays are special purpose computers on a chip with a 
group of processing elements working in a SIMD fashion to produce out- 
put for other groups of processing elements. Hence, the idea is to 
have simple and cheap processors that calculate and transmit data in a 
regular fashion for a particular application. 

Examples of two MIMD bus arrays are the CM* at Carnegie Mellon 
University (Swan[l 977]) and ZMOB under development at the University of 
Maryland (Rieger.Trlgg. and Banen981]>. 

Examples of two MIMD switched arrays are the C.mmp of Carnegie 
Mellon University (Fuller and Harbisonn9781) and the Hetrogeneous Ele- 
ment Processor (HEP) being implemented by Denelcor. Inc. 
(SmithH978]). 

3.2. The Finite Element Machine Architecture 

Of particular Interest to us is the architecture of the Finite Element 
Machine (FEM) at NASA Langley Research Center. The FEM is an array 
of microprocessors that can operate asynchronously, and can be classi- 
fied as an MIMD computer that is arranged in a square lattice confi- 
guration with dedicated local communication links between any processor 
and its eight nearest neighbors. However, it is not strictly a lattice type 
array since a bus connects all the processors. 

To summarize the motivation for FEM. consider the structure in Fig- 


ure 1. 
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Figure 1. Example structure 


This structure is composed of simple rod elements of two nodes each. 
Two nodes are said to be connected if they are on the same rod. For 
example, node 1 Is connected to nodes 2. 5, and 10 As was seen in 
Chapter 2, the finite element method produces a block stiffness matrix 
that has as many nonzero off-diagonal blocks in row i as node i has 
connected nodes. If an iterative method is used to solve the equations, 
the solution at node l is only a function of the information from node i's 
connected nodes. This suggests assigning a processor to each node and 
providing it with local communication links to processors containing con- 
nected nodes. For FEM. the idea was to provide a fixed number of 

local links per processor and to provide a global bus to handle the 
connectivities that are not satisfied locally. For example, if eight nearest 
neighbor links are available, processor 7 can operate with local links 
only, but processor 1 must communicate globally with processor 10 

A detailed description of the machine is given in the following para- 
graphs. Most of this material can be found in Jordan, et alfl979], Jor- 
dan[1978], and [1979], Jordan and Sawyer[1978], and Podsiadlo and Jor- 
dan(1981J. Additional hardware information was obtained through personal 
discussions with Tom Crockett, Judson Knott, and David Loendorf at 
NASA A current status report on the hardware, system software, and 
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application software can be found in Storsaali. Peebles, Crockett, Knott, 
and Adams[19821. 

The FEM architecture as shown in Figures 2 and 3, consists of a 
controller and individual microprocessors, which we will call nodal pro- 
cessors. The controller is connected to the nodal processors via a glo- 
bal bus. Each nodal processor of the FEM is connected to its eight 
nearest neighbors via the two-way local communication links. These 
local links also connect the edges of the FEM in a toroidal wrap around 
fashion so that every processor has eight nearest neighbors Each pro- 
cessor can also communicate globally to its non-nearest neighbor pro- 
cessors by using the serial global bus. Each microprocessor has its 

own operating system and memory. Once operation begins, each pro- 
cessor executes its program independent of the controller. This archi- 

tecture can be classified as an MIMD (multiple instruction, multiple data) 
machine with no central shared memory. 

3.2.1. Controller Hardware 

The controller is a Texas Instruments 990/10 mini-computer. It has 
128K words of memory and four 5-megabtye disk drives. A printer is the 
output device for printing files from the disk of the Tl 990 That is, 
output from the nodal processors is transferred over the global bus and 
placed on a file on the 990 disk, for printing when needed. 

The DX-10 full screen editor on the 990 is used by the applications 
programmer to edit, compile, and link the programs that are to be run 
on FEM The necessary data files for each processor can be created 
by the editor and stored on the 990 disk Alternatively, for complicated 
problem geometry, a program could be written to split a global data file 
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into appropriate data files for each processor. Nevertheless, these data 
files and the linked program file are stored on the 990 disk until time 
for downloading to FEM. This downloading is accomplished by executing 
commands on the controller 

3.2.2. Nodal Processor Hardware 

The nodal processors are comprised of three hardware boards each, 
namely, the CPU board, the 10-1 board, and the 10-2 board. These 
components will be described in the following sections. 

3.2.2.I. CPU Board 

The CPU board contains a Tl 9900 16 bit microprocessor. 16K bytes 
of read only memory (ROM). 32K bytes of random access memory (RAM), 
and an Advanced Micro Devices AM9512 floating point chip. An illustra- 
tion of a nodal processor is given in Figure 4. 

The ROM is reserved for NODAL EXEC, the nodal executive operating 
system, and PASLIB, the PASCAL subroutine library that contains various 
basic routines such as SEND and RECEIVE for transmitting data. 
Approximately 4K of RAM is also reserved for NODAL EXEC. The 
remaining 28K RAM is available for program code, run time data struc- 
tures. and special user allocated data storage, called data areas. The 
user may request as many as 32 separate data areas of convenient 
sizes for a particular application program These areas will be used to 
store the problem data that is downloaded from the 990 disk. In prac- 
tice. the program code Is loaded at address 8000 hexadecimal toward 
the bottom of the stack and the data areas are loaded directly above 
the NODAL EXEC towards the top of the stack as depicted below. 




MICROPROCESSOR 

MODEL 






FIGURE 2, THE FINITE ELEMENT MACHINE 




FIGURE 3, THE FINITE ELEMENT MACHINE ARCHITECTURE 
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FIGURE 4. THE FEM NODAL PROCESSOR 
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address FFFE (hex.) 


The AM9512 floating point chip with a clock frequency of 2 MHz 
provides single precision (32-bit. 25 bit mantissa) and double precision 
(64-bit, 57 bit mantissa) add. subtract, multiply and divide operations. To 
use this capability, the operands must be loaded by software which 
requires approximately 358 microseconds for two single precision 

numbers. Once the operands are loaded, a single precision floating 
point add or subtract can be performed in approximately 29 
microseconds, a single precision floating point multiply in approximately 
99 microseconds, and a single precision floating point divide in approxi- 
mately 114 microseconds. These times for loading the operands were 
provided by Tom Crockett at NASA Langley Research Center and the 

arithmetic times can be found in the Advanced Micro Devices 
Manual[1979], 

3.2.2.2. 10-1 Board 

The hardware circuits for the local communication links and the 

sum/max network are on the 10-1 board. As stated earlier, each pro- 

cessor is connected locally to its eight nearest neighbors Each proces- 
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sor has 12 8-bit by 32-bit FIFO hardware queues for receiving values 
from its neighbors. Likewise, an output register is available for sending 
values to neighbors. Software queues for synchronous and asynchronous 
data receiving are also implemented. An Illustration of the local com- 
munication links is given in Figure 5. 

A separate hardware circuitry was designed by Jordan for calculating 
maximums and sums across processors (Jordan. Scalabrim. and Cal- 
vert(19791). The sum/max hardware can be envisioned as a binary tree 
with each processor initially at the leaves of the tree. The values from 
pairs of processors are added and passed to the next level in the tree. 
The procedure is repeated until the final result is obtained. This allows a 
sum to be calculated in log 2 p time where p is the total number of pro- 
cessors. An algorithm for finding the maximum value when only local 
links and a global bus are available can be found in Bokhari[1979]. 

3.2.2.3. 10-2 Board 

The hardware for the global bus connections, a signal flag network, 
to be described below, and the processor's self identification tag is 
resident on the 10-2 board. 

In addition to serving as a connection between the controller and 
FEM. the global bus connects each processor to every other processor. 
The bus is 16 bits wide; therefore, one single precision number requires 
two transmissions. Information to be sent on the bus is tagged with a 
source, destination, and mode tag. The mode indicates whether the data 
is to be broadcast to all processors or if it is to go to the destination 
processor. Since contention Is likely to occur for the bus, the outgoing 
data is buffered Input from the bus is detected by the address detector 


25a 


FROM NEIGHBOR N FROM NEIGHBOR NE FROM NEIGHBOR NW 



TO NEIGHBOR N TO NEIGHBOR NE TO NEIGHBOR NW 


FIGURE 5. THE FEM LOCAL COMMUNICATION LINKS 
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and queued in an imput buffer. An overview of the global bus is given 
in Figure 6. 

Each processor is part of eight separate signal flag networks, flags 
0 through 7. which can be used for synchronization or decision making. 
Each flag can be enabled Into or disabled from its network by a PASLIB 
routine. Data area 0 contains a list of processors, both local and glo- 
bal, among which information must be shared during program execution. 
These processors will be called logically connected processors. The 
major function of the flags is to answer the following questions. 


Any(k)? Zs Flag k set in any enabled logically 
connected processor? 

All(k)? Is Flag k set in all enabled logically 
connected processors? 

Sync(k)? Was All(k) true previously? 


Flag 0 has a FIRST bit. This is used to determine if this processor 
was the first one to set the flag. An illustration of the signal flag net- 
work Is given in Figure 7 

The processor's self identification number Is hardwired on the 10-2 
board. A PASLIB routine is used to return the value of the self-id as 
needed during program execution. Typical instances in which the self-id 
is necessary are decision making during computation and mterprocessor 


communication. 
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Presently at NASA. 4 processors are operational with 16 scheduled 
(or the immediate future and 36 scheduled (or December 1982. The ini- 
tial design was for 1024 processors configured in a 32 by 32 array. 



I 



FIGURE 6, THE FEM GLOBAL BUS 
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FIGURE 7. THE FEM SIGNAL FLAG NETWORK 































CHAPTER 4 


Parallel Matrix Assembly and Stress Calculation 


4.1. Parallel Matrix Assembly 

This section describes how to assemble in parallel the nonzero 
coefficients of the stiffness matrix K. as described In Chapter 2, on an 
array computer like the Finite Element Machine A description and 
analysis of the assembly process with and without communication between 
processors will be given. 


Figure 1 shows a region that is discretized by eight triangular finite 


elements which are comprised of three nodes each. 


p 

( 0 , 2 ) 


Q 

( 2 , 2 ) 



If there are d unknowns at each of the nine nodes, the resulting stiff- 
ness matrix K will have dimension 9dx9d. These nine nodes (and asso- 
ciated data) are partitioned to the three processors (P.Q.R) so that dur- 
ing the solution of Kit=L each processor will calculate exactly 3d unk- 
nowns. 

The coefficients of K that are required by processor P for the solu- 
tion of the unknowns at nodes 4,5. and 7 must either be calculated by 
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P or calculated by processors R or Q and communicated to P before 
the displacement calculation begins. In both cases, storage must be 
allocated in P's memory for these coefficients. The amount of storage 
depends on the number of nodes assigned to P. the number of equa- 
tions at each node, and the number of nodes that share a common fin- 
ite element with P's nodes. In particular, for d equations at each node, 
a dxd coefficient matrix must be calculated for every pair of nodes on 
the same finite element if at least one of these nodes is assigned to 
processor P. In addition, one dxd symmetric matrix must be found for 
each node that is assigned to P. The matrices that must be stored in 
P's memory for the region in Figure 1 are indicated in Table 1. The 
nodes (4,5,7) are labeled Interior and the nodes in other processors that 
share a common finite element with any of these nodes are labelled 
Exterior and [x] represents a dxd matrix. 

Interior Exterior 



4 

5 

7 

12 3 

6 8 

4 

[x] 

[x] 

[X] 

[X] [x] 


5 

[x] 

[x] 

[X] 

[X] [X] 

[X] [x] 

7 

[X] 

[x] 

[X] 


[x] 


Table 1. K Matrix Coefficients for Processor P 

For an iterative solution of Ku.-t. Table 1 contains all the coefficients 
needed for processor P to solve for the displacements at nodes 4. 5, 
and 7. If a direct method such as Cholesky factorization were used 
Instead, extra storage must be allocated for the coefficients that will ‘fill 
in* the band of the upper triangular factor of K during the decomposi- 
tion process For Figure 1. space must be reserved in processor P for 
the 4-3. 4-6, 7-6, and 7-9 dxd fill in matrices. In particular, space 



30 


must be allocated in Table 1. for the fill in coefficients 4-3. 4-6. and 
7-6 

We now describe how to assemble the coefficients in Table 1 by 

considering node 5. To find the tx] coefficients in the second row and 
second column of Table 1 (denoted by 5-5). the coordinates of nodes 
2.3,6. and 8 in addition to those of nodes 4 and 7 must be available to 
processor P since node 5 is on elements El, E2. E3. E4, E5. and E6 
as shown in Figure 1. Hence, the 5-5 Cxi coefficients can not be found 
without coordinate Information that resides in other processors In fact, 
the same conclusion holds for every [x] in Table 1. This observation 

leads us to consider the following strategy: 

(1) Load each processor's memory at the outset with all problem data 
necessary for the calculation of the coefficients that are required 
in the solution of the displacement equations at its collection of 
nodes. 

(2) Implement one of the following two policies. 

Policy 2 Each processor will calculate the upper triangular and 
diagonal coefficients associated with its collection of nodes as 

well as the coefficients associated with the connection between 
the processor's interior and exterior nodes. Communication 
between processors will not be required for this policy. 

Policy 2. Each processor will calculate the upper triangular and 
diagonal coefficients associated with its collection of nodes The 
coefficients that are associated with the connection between the 

processor's interior and exterior nodes will be sent and received 
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between processors. In particular, the lower triangular coefficients 
must be received and the upper triangular ones sent This com- 
munication can be done on the Finite Element Machine via the 
local neighbor links or the global bus. 

In either case, all the coefficients that are necessary for the dis- 
placement calculation will be stored In the memory of the proces- 
sor. 

Both strategies can be implemented by providing each processor with 
a table of elements and their associated properties (type, thickness, etc.), 
coordinates of interior and exterior nodes, and associated processors for 
the exterior nodes for use in data communication. Typical problem data 
for processor P Is given In Table 2. 

Global Node Node 

Elements Number Coordinates Processor 



8 

7 

5 

4 

(0,1) 


_ 


5 

s 

8 

5 

(1,1) 
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6 

5 

3 

7 
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3 

5 

1 
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5 
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2 

2 
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R 


4 

5 

7 

3 

(2,0) 


R 


1 

2 

4 

6 

(2,1) 


Q 





8 

(1,2) 


Q 



Table 2. 

Problem 

Data for Processor P 


During 

the 

assembly 

process. 

integrations are 

performed over the ele- 

ments 

(one 

at a 

time) 

in a processor's element 

table 

and the resulting 
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contributions are added to the appropriate global coefficients For exam- 
ple. integrations over element 4-5-7 (E6) yields contributions to the 
coefficients shown below. 

4 5 7 

4 

5 

7 

Now consider the element 5-4-2 (E5). Processor P calculates the 

following contributions for the cases of commmumcation (Policy 2) and 
no communication (Policy 1) between processors respectively while pro- 
cessor R calculates the same contributions regardless of the policy. 



2 

4 5 


2 4 

5 


2 4 5 

2 



2 



2 

‘[X] [x] [xf 

4 


[X] [X] 

4 

[X] [X] 

[X] 

4 


5 


[x] 

5 

_[X] 

[X]_ 

5 



Processor P Processor P Processor R 

( Policy 1 ) ( Policy 2 ) 


This is possible because processor R also has element 5-4-2 in its 
element table For Policy 2. the 4-2 and 5-2 contributions must be 
sent by processor R to processor P and received by processor P. 
whereas for Policy 1. the 4-2 (or 2-4) and the 5-2 (or 2-5) contribu- 
tions are calculated by both processors P and R thereby resulting in a 
duplication of effort Similar arguments hold for the 5-5 contributions 
from the integrations over elements E1.E2.E3. and E4 In Figure 1. The 
amount of communication overhead and effort duplication are analyzed in 


[x] [x] [x] 

[x] [x] 

[x] 


section 4 2 
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The output of the assembly process will be the data structures 
KCOEFF and CONNECTED_TO which describe the K matrix coefficients 

necessary for the calculation of displacements at processor P's nodes, 
and the data structure SEND_TO which describes the processors to which 
values from P must be sent during displacement calculation. These 
structures are illustrated below for the region of Figure 1 

Node KCOEFF Node CONNECTED .TQ 

4 [4] [5] [7] [1] [2] 45712 

5 [5] [4] [7] [2] [3] [6] [8] 5 472368 

7 [7] [4] [5] [8] 7458 

Node SEND_ TO 

4 R 

5 R Q 

7 Q 

It is possible to use more space efficient storage structures since both 
the upper and lower nonzero parts of the symmetrix K matrix are stored. 
However, these structures were implemented to allow for ease in the 
computation required by the iterative solution algorithm for the displace- 
ments Also, if many processors are available so that a small number 
of nodes may be assigned to each processor, this extra storage will not 
be prohibitive 

A routine that uses Policy 1 and the data structures described 
above was written for a FEM of any number of processors and imple- 
mented on a 4 processor FEM for the equations of plane stress on a 
region discretized by linear triangular elements as shown in Figure 2b 
The same ideas can be used for other partial differential equations and 
finite elements as well since the influence of a particular partial 
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differential equation and finite element is contained in a subprocedure 
that sets the value of the coefficients after performing integrations over 
the element 


4.2. Speedup for Parallel Matrix Assembly 

The speedup for the matrix assembly process is the time to assem- 
ble the matrix on a uniprocessor machine divided by the time to assem- 
ble the matrix on an array with p processors. The processor efficiency 
is defined to be the speedup divided by the number of processors: 


Speedup (p) = Time C\)/Time (p) 
Efficiency ip) = Speedup (p)/p 


(4.1) 


For a rectangular domain the speedup is easily calculated and will be 
described below for the symmetric stiffness matrix that results from a 
plane stress analysis of a plate that has been discretized by linear tri- 
angular finite elements, however, the same type of analysis can be done 
for other problems and finite elements as well 

Let the plate be discretized so that N nodes are arranged in r 
rows and c columns as shown in Figure 2a 
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k 



i =3,k=3.p =4 

Figure 2b. Four Processors 




/ = 1.*=3.p = 12 

Figure 2d. Twelve Processors 


First, the time required for a uniprocessor matrix assembly will be 
derived. Since K is symmetric, only the upper triangular and the diago- 
nal part of K will be calculated. If each node in the plate represents 
d unknowns 2 for the plane stress problem), the K matrix can be 
partitioned such that each partitioned row represents the equations at a 
single node in the problem grid. Then for the discretization shown in 
Figure 2a there will be -at most 3 dxd matrices in the strictly upper tri- 
angular part of each partitioned row and the diagonal entry will be a 
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dxd symmetric matrix. These four matrices can be visualized as the 
contributions from a node's northwest, north, and east neighbor nodes as 
well as the contribution from the node itself. The connectivities of these 
neighbor nodes to the particular node C are shown in Figure 3. 



Figure 3 . Upper Triangular: Connections to Node C 

2 

Now because of symmetry only ( d id)/ 2 elements of the dxd matrix on 

the diagonal must be calculated. The off diagonal matrices on the other 

2 

hand are not necessarily symmetric and also may be full so that all d 
elements must be formed. The total number of entries that must be 
calculated for the diagonal matrices and the matrices in the upper tri- 
angular part of K are itemized in <4 2). 

2 

N (.d id)/ 2 for the diagonal matrices 

2 

(N-c)d for the north matrices 

(4.2) 

2 

(,N-r)d for the east matrices 

2 

(/V-r-c+l)d for the northwest matrices 

Hence, the total time (in units of the number of entries) needed to 

assemble the K matrix on a uniprocessor machine is 

Time (1) = (3.5A/-2r-2ctl)d 2 +0 5Nd (4.3) 

Next, the time to assemble K in parallel will be given for three dif- 
ferent array processor arrangements first, for the special case of all 
boundary processors, that is. 4 processors arranged in a 2x2 grid, 
secondly, for p >4 processors arranged in a p/2 x 2 grid, and lastly for 
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p >4 processors arranged in a Vp* x Vp" grid. The p/2 x 2 grid is 
only considered here for the purpose of analyzing the cases shown in 
Figure 2. In practice the problems of interest will be large enough to 
utilize a processor grid that contains interior processors For all three 
cases, assume N nodes are partitioned to p processors with each pro- 
cessor receiving an ixk grid of nodes as illustrated in Figures 2b, 2c, 
and 2d 

The number of upper triangular and diagonal coefficients that must 
be calculated by the lower right corner processor when processors are 
arranged in a 2x2 grid is itemized in (4.4). This lower right corner 
processor is the limiting processor in the sense that it has more data 
communication to perform. 

2 

lk(d +d)/2 for the diagonal matrices 

2 

ikd for the north matrices 

(4 4) 

2 

ikd for the northwest matrices 

2 

/0c-1)cf for the east matrices 

For Policy 1. extra time must be added to (4 4) to account for the 
redundant calculation of the west matrices for the non-mtenor nodes of 
the processor For Policy 2. these values must be received instead of 

calculated and the non-interior upper triangular values sent The 
number of duplications for Policy 1 and the number of sends and 
receives for Policy 2 are given by (4 5) 

2 

Duplication (4) = id 

Receive (4) = id 2 (4 5) 

Send (4) = (2X+/-Dd 2 

Let a and b represent the number of coefficient calculations required to 



38 


equal the time of one send and one receive operation respectively 
Then, the total time for the parallel matrix assembly is given by (4 6a) 
and (4.6b) for Policies 1 and 2 respectively. 


Time ^(4) = 3 5jkd 2 +0 5/kd 

TimegM) = [3 5/ft-/ ta (2k t/ -!)+£>/ Jd 2 +0.5 ikd 


(4.6a) 


The speedup and efficiency can now be calculated by using (4 1) 
These values are given in (4 7a) and (4 7b) for Policy 1 and 2 respec- 
tively. 


Speedup ^ (4) 
Ef f iciency^ (4) 


(3.5/V-2r-2c tPd 2 t0,5 Nd 
3.5/fcd 2 tO 5/kd 

(3.5A/-2r-2ct1)d 2 +0.5Nd 
3 5A/d 2 fO.SNd 


(4.7a) 


Speedup ^ (4) 
Ef f iciency^ (4) 


(3.5A/-2r-2ct1)d 2 tO 5 Nd 
[3 Sik +(bi -i +a (2k +/ -1))ld 2 +0 5 ik 

(3.5/V-2r-2c tl)d 2 t0.5A/d 
(3.5A/tp (b/-/ta (2/<+/-l))]d 2 t0 5 Nd 


(4 7 b) 


If p >4 processors are arranged in a p/2 x 2 grid so that each 
processor is on the grid boundary, the number of upper triangular and 
diagonal coefficients that must be calculated by the limiting processors 
on the left boundary of the processor array is itemized in (4 8) 


ik (d td )/2 
ikd 2 

I 0c-1)d 2 
ikd 2 


for the diagonal matrices 
for the north matrices 
for the northwest matrices 


(4 8) 


for the east matrices 
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For Policy 1. the redundant calculation of the south and southeast 
border matrices must be added to (4.8). This duplication and the 
receive and send communications that are necessary to implement Policy 
2 are given in (4 9). 

Duplication (p) = (2/<+/-1)d 2 

Send Ip) = (2fc+/-1)d 2 (4 9) 

Receive (p) = (2fc+/'-1)d 2 


The total time to assemble in parallel the K matrix for Policies 1 and 2 
is given in (4.10a) and (4.10b) respectively. 


T/me 1 (p) = (3.5M+2/c-1)d +0.5/*d 


(4.10a) 


Time 2 (p) = [3.5/k-/+(a+b)(2fc+/-1)]d‘+0.5//<d 


(4.10b) 


The speedup and efficiency are given by (4.11a) (4.11b) for Policy 1 and 
Policy 2 respectively. 


Speedup ^ <p) = 


Ef 1 iciency^ip) = 


(3.5A/-2r-2c + 1)d t0.5A/d 
(3.5/fc +2k~i )d 2 +0 5 ikd 

(3 5A/-2r-2c + 1)d 2 tO 5 Nd 
(3.5 N +p (2/c -1)>d 2 +0.5A/d 


(4 11a) 


Speedup 2 (p) 

Efficiency 2 (p) 


(3 5A/-2r-2c + 1)d 2 tO 5/Vd 
(3 5ik-i+ia+b)(2k+i~)))d 2 +0.5ikd 

(3.5N-2r-2c+))d 2 +0.5Nd 
{3.5 A/ -p/+p(a+PX2*+/-1)]d 2 +0 5A/d 


(4 11b) 


Lastly, assume that p >4 processors are arranged in a square 
^p" x Vp" grid so that there will be one or more processors com- 
pletely inside the processor grid For this case, the upper triangular 
and diagonal coefficients calculated by a limiting interior processor are 
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itemized m (4.12). 


(4 12) 


lk(d td)/2 for the diagonal matrices 

2 

ikd for the north matrices 

2 

Ikd for the northwest matrices 

2 

ikd for the east matrices 

For Policy 1. the duplicated calculations for this processor arrangement 
are due to the calculation in each processor of the west, south, and 
southeast matrices associated with the connection between the interior 
and exterior nodes and is given in (4 13). The sends and receives 
necessary to implement Policy 2 are also given in (4 13). 


Duplication (p) = (2A+2/'-l)d 2 
Send (p) = (2*+2/-1)d 2 
Receive (p) = (2*+2/-l)d 2 


(4.13) 


The total time to calculate the K matrix for Policy 1 and 2 is given in 
(4.14a) and (4.14b) respectively 

T/me-j (p) = (3.5/)c +2k +2/ -l)d 2 +0.5/kd (4 14a) 

Time ^ (p) = 13 5/fc+(a+b)(2f<+2/-l)]d 2 +0 5/kd (4 14b) 

The associated speedups and efficiencies are given by <4 15a) and 
(4.15b) 

Speedup <p> = <35W-2r-2ctl)a 2 t0 5W 
(3 5//( +2fc +2/ -l)d 2 tO 5ikd 

Efficiency. <p) = 

' (3 5A/tp(2/c+2/-l))d 2 t0 SNd 


(4.15a) 
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Speedup ^(p) = 
Efflclency 2 <p) - 


(3.5A/-2r-2ctl)d 2 t0.5/Vd 
[3 5//c+(a fb ) (2k +2/ -l)]d 2 +0 5 jkd 

(3.5/V-2r-2ct1)d 2 t0.5/Vd 
[3.5 N +p (a t£> ) (2k +2/ -1 )]d 2 +0.5A/d 


(4.15b) 


The values of the total assembly time, the time due to duplication, 
the speedup, and the efficiency are given for Policy 1 in Table 3. for 
the particular p. /. and k values corresponding to Figures 2b, 2c, and 
2d. 



£ 

/ 

k 

Total 

Duplication 

Speedup 

Efficiency 


l 

6 

6 

448 

0 

— 

— 


4 

3 

3 

135 

12 

3.32 

83% 


6 

2 

3 

110 

28 

4.07 

68% 


12 

1 

3 

65 

24 

6.89 

57% 



Table 3 

i . Assembly Times for 

Figure 2 







(Policy 1. ) 



Equations 

(4.9) 

and 

(4 13) 

show that the 

duplication is a decreasing 

function 

of 

the 

number of 

processors when 

p >4 but 

the efficiency also 

decreases 

since the 

duplication comprises 

a larger 

percentage of the 

parallel 

time. 

This is 

also 

seen from Table 

3. 


Results 

on 

a 2x2 

FEM 

for the 36 node 

plane stress problem with 2 


equations per node and 9 nodes per processor (illustrated in Figure 2b) 
show a speedup of 3.2 over the corresponding uniprocessor algorithm for 
Policy 1. This number compares quite well with the value of 3.32 in 
Table 3 

The values of the sequential assembly time, the parallel assembly 
time, the time due to communication overhead, the speedup, and the 
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efficiency are given for Policy 2 in Tables 4a. 4b. and 4c for the par- 
ticular p. i. and k values corresponding to Figures 2b. 2c. and 2d 


£ 

/ 

k 

Total 

Overhead 

Speedup 

Efficiency 

l 

6 

6 

448 

0 

— 

— 

4 

3 

3 

167 

44 

2.68 

67% 

6 

2 

3 

138 

56 

3.25 

54% 

12 

1 

3 

89 

48 

5.03 

42% 


Table 4a. Assembly Times for Figure 2. 

a =1 :6 =1 (Policy 2) 


£ 

/ 

k 

Total 

Overhead 

Speedup 

Efficiency 

1 

6 

6 

448 

0 

— 

— 

4 

3 

3 

145 

22 

3.09 

77% 

6 

2 

3 

110 

28 

4.07 

68% 

12 

1 

3 

65 

24 

6.89 

57% 


Table 

4b. Assembly Times for Figure 
a =0 5.6 =0 5 (Policy 2) 

2 . 


£ 

/ 

k 

Total 

Overhead 

Speedup 

Efficiency 

1 

6 

6 

448 

0 


— 

4 

3 

3 

134 

11 

3.34 

84% 

6 

2 

3 

96 

14 

4.67 

78% 

L2 

1 

3 

53 

12 

8.45 

70% 


Table 4c . Assembly Times for Figure 2 . 

a =0.25.6=0.25 (Policy 2) 
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The values in Tables 3 and 4a, 4b. and 4c are for either a 2x2 or 
a p/2 x 2 processor grid Equations (4.13), (4 14), and (4 15) were 
used to predict the corresponding times for a plate with 768 nodes 
arranged in 16 rows and 48 columns in order to to investigate the effect 


of completely 

interior 

processors 

when 

p >4. 

The results are given 

Table 5 for 

Policy 

1 

and 

Table 

6a, 6b. 

6c. 

for Policy 2 

£ 

/ 

k 


Total 

Overhead 

Speedup Efficiency 

l 

16 

48 


11012 

0 




4 

8 

24 


2880 

32 


3.82 96% 

16 

4 

12 


844 

124 


13.05 82% 

64 

2 

6 


240 

60 


45.88 72% 

256 

1 

3 


73 

28 


150.85 59% 


Table 

5. 

Assembly Times for 16x48 Plate 
(Policy 1) 


£ 

/ 

k 

Total 

Overhead 

Speedup 

Efficiency 

1 

16 

48 

11012 

0 

— 

— 

4 

8 

24 

3100 

252 

3.55 

89% 

16 

4 

12 

968 

248 

11.38 

71% 

64 

2 

6 

300 

120 

36.71 

57% 

56 

1 

3 

101 

56 

109.03 

43% 


Assembly Times for 16x48 Plate 
a = 1 ;b = 1 (Policy 2) 


Table 6a. 
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£ 

l 

/ 

16 

k 

48 

Total 

11012 

Overhead 

0 

Speedup 

Efficiency 

4 

8 

24 

2972 

126 

3.70 

93% 

16 

4 

12 

844 

124 

13.05 

82% 

64 

2 

6 

240 

60 

45.88 

72% 

256 

1 

3 

73 

28 

150.85 

59% 


Table 6b. 

Assembly Times for 16x48 
a =0 5:b =0 5 (Policy 2) 

Plate 


£ 

/ 

k 

Total 

Overhead 

Speedup 

Efficiency 

1 

16 

48 

11012 

0 

— 

— 

4 

8 

24 

2911 

63 

3.78 

95% 

16 

4 

12 

782 

62 

14.08 

88% 

64 

2 

6 

210 

30 

52.44 

82% 

56 

1 

3 

59 

14 

186.00 

73% 


Table 6c. Assembly Times for 16x48 Plate 
a=0 . 25 ;b**0 . 25 (Policy 2) 

The results in Tables 3. 4. 5. and 6 show that for values of a and 
b below 0 5 the best policy for assembling the stiffness matrix K on an 
array of p >4 processors will be Policy 2. that is. communication between 
the processors is warranted. For the special case of 4 processors, the 

values of a and b must be lower than 0 25 before Policy 2 is recom- 
mended The conditions that must be satisfied for Policy 2 to be the 

best policy are easily found from equations 4 6a and 4 6b. 4 10a and 

4 10b. 4 14a and 4 14b for the cases of a 2x2 processor grid, a 
p/2 x 2 grid, and a Vp" grid respectively These conditions 



are given in (4 16) below. 
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(4 16) 


a(2A+/-1)+/(b-1) < 0 for p = 4 

(a +0 ) < 1 for p>4 

When the problem of Figure 2a. is solved with 4 processors, /= 3 and 

k=3 so that the conditions in (4.16) become the following: 

8a + 3b < 3 for p= 4 

(4.17) 

(a+b) < 1 for p >4 

For the problem of the 16x48 plate. /= 8 and k=24 when 4 processors 
are used so that the conditions in (4.16) become the following: 

55a t 8b < 8 for p= 4 

(4.18) 

(a+b) < 1 for p >4 

Recall that a and b are the number of K matrix coefficient calculations 
that comprise the time to send and receive a value between processors 
respectively. Equation (4.16) shows that Policy 2 is more likely to be the 
optimal policy when the values of a and b are small The values of a 
and b will decrease for two reasons. First, if the communication 
between processors is made faster a and b will necessarily be less. 
Secondly, if higher order elements or more complicated integration rules 
are used the time to calculate one coefficient will increase which will in 
effect make a and b less For these situations, the assembly process 
should include communication between processors. 


4.3. Parallel Stress Calculation 

The purpose of this section Is to describe the stress calculation, 
and demonstrate that it can be made with no communication between 
processors. 
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After the system of displacement equations has been solved (the solution 
algorithms will be discussed in Chapters 5. and 6.). the displacements at 
the nodal points In processor P's CONNECTED_TO data structure are in 
processor P's local memory since these values were either calculated by 
processor P or were passed to it during execution. Hence, the nodal 

displacement values on the elements in processor P's element table are 
resident in processor P's memory. As an illustration, consider the pro- 
cessor assignment In Figure 5. 



Figure 4. Processor Assignment 

Displacements at the nodes 2. 3, 4. and 5 are in processor P's local 
memory after the displacement calculation is complete. Likewise, the 
same values are in processor Q's memory. 

For the case of linear basis functions on the triangular elements, 
the stresses are constant across the triangles. The obvious question is 
whether processor P or Q should calculate the stresses on a given tri- 
angle. Define as the first node of the triangles in Figure 5 the node 
associated with the right angle and then number the remaining two 
nodes in a counterclockwise fashion. A good rule would be to require 
the processor that has the first node of the element as an interior node 
to calculate the stresses on that element since this will require no 
duplication of effort. For example, processor P calculates stresses on 
element 1-3-2 and element 3-5-4 whereas processor Q calculates 
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stresses on element 6-4-5 and element 4-2-3. 

Recall from Chapter 2 that the actual stress calculation, in the 
linear element case, involves the pre-multiplication of the element's nodal 
displacement vector by a stress matrix that is a function of the coordi- 
nates of the element's nodal points. If the coordinate information in 
Table 2. is available to the stress procedure, this matrix is rapidly cal- 
culated. 

Stress calculation results for a 36 node plane stress problem (50 
elements) run on a 2x2 FEM showed a speedup of 4 over a uniproces- 
sor version for the stress calculation. These results indicate that a 
speedup of 0(p) can be expected when p processors are used to cal- 
culate the stresses. This perfect speedup is a consequence of the 
absence of both communication between processors and redundant calcu- 
lations. 

The use of higher order basis functions will not produce constant 
stresses over an element, but the stresses can be calculated from the 
element's nodal displacements and nodal coordinate values without any 
duplication of effort among the processors Hence, for these basis func- 
tions. 0 (p) speedup is also predicted 



CHAPTER 5 


Parallel Linear Stationary Iterative Methods 

In this chapter we consider the implementation of linear stationary 
iterative methods for the solution of 

Ksi=L (5.1) 

on both vector computers and parallel arrays. For concreteness, we will 
use the CYBER 203/205 as an example of the former and NASA 
Langley's Finite Element Machine as an example of the latter. The 
implementation of Jacobi's method Is discussed in Section 5.1, the 
description and Implementation of a new method. Multi-color SOR. is 
given m Section 5.2. a Multi-color SSOR method Is discussed in Section 
5 3. and Implementation considerations for block Iterative methods is 
addressed in Section 5 4 

5.1. The Jacobi Iterative Method 


Let the matrix K with elements k be split as 


K = 

D-l-U 


(5 2) 

where D is the diagonal part of 

K and -L and -U are 

the 

strictly 

lower and upper triangular parts 

of K respectively Then 

the 

Jacobi 

iterative method for solving (5.1) is 

given by 



D^ +1 = 

(L*U)u. k + L 


(5 3) 

or 




k + 1 
iL 

= Bsl + £. 


(5 4) 


where 
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£= 0 _1 1 


B - O* 1 (L+0) 


(5.5) 


and the matrix 8 is called the Jacobi iteration matrix. The conditions 
for the iteration (5.4) to converge are given below (Youngn97U). 


Jacobi Convergence Theorem 

Let K be a real symmetric positive definite matrix. Then the 
Jacobi iteration converges if and only if O+L+L/ is positive 
definite. 


This theorem shows that the Jacobi method is not guaranteed to con- 
verge for all symmetric and positive definite matrices K such as those 
arising from finite element discretizations as discussed in Chapter 2 

Closely related to the Jacobi method is the simultaneous overrelaxa- 
tion method (JOR method) defined by 

fc + 1 _ k .. 

<L = flu + uc (5.6) 

where 

8 = uB + (1-cj)/ (5.7) 

u 

and B is called the JOR iteration matrix. 

GJ 

The conditions for this iteration to converge are given below (Young 
[1971]): 

JOR Convergence Theorem 

Let K be a real symmetric positive definite matrix Then the 
JOR iteration converges if and only If 2u 1 D-K is positive 
definite. The condition that 2u ^ D-K is positive definite may 
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2 

be replaced by 0 <u><y^ <2 where u m | n ^0 is the smallest 

min 

eigenvalue of B. 

This theorem Implies that by appropriately choosing u>, the JOR method 
can be made to converge If t/ mjn <0. However, this choice of w 
depends on knowledge of the smallest eigenvalue of S . In fact, Hayashi 
and Yokomana [1977] report that JOR diverged for finite element discreti- 
zations of typical structural problems such as cantilevered beams and 
simply supported plates unless the relaxation parameter u> was carefully 
chosen. Hence, for problems of interest to us, the JOR or Jacobi 

methods are not suitable because the eigenvalues of B are rarely known 

2 

in advance and convergence Is not guaranteed for <oj<2. How- 

min 

ever, the Implementation of these methods on vector computers or paral- 
lel arrays Is of interest to us since these methods may successfully be 
used as preconditioners for the conjugate gradient method as discussed 
In Chapter 6 or in the Implementation of an SOR method as will be dis- 
cussed In Section 5.2. 

We now describe how Jacobi's method can be implemented on a 
parallel computer For concreteness, we consider an elliptic equation of 
the form 

u + au + u - 1 (5.8) 

xx xy yy 

on the unit square with Dirichlet boundary conditions where a is a given 
constant and I Is a given function of x and y. We discretize (5.8) with 
the usual second-order finite difference approximations (see, e g , For- 
sythe and Wasow [I960]) which give the difference equations 
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U,., ,+U. , ,+U, ^,+u. . , -4 u., 

/ + !,/ /-!./ /./+1 /./-l if 


+ f IU / + l./+1 _ ‘'/-l./ + l + Vl./-l _U / +1./-1 1 = 1 ij 


(5.9) 


where /> is the spacing between grid points, /,/= 1,2 A/, /i(A/+l)=l, 

denotes the approximate solution at the i.jth grid point, and t Uh .ih). 

Now. the Jacobi method (5.4) for (5.9) can be written in the form 
used for implementation as 


(k+ 1) 1, (k) tt) ^ Ik) t k ) ,1.2, 


a, (*) Of) Al .Ot) „Of) , 
■ | T 6 ru /+ i./+i ""/- 1./+1 /- 1./-1 /+ 1./+1 


(5.10) 






First, we consider the implementation of (5.10) on the CDC CYBER 

203/205 where vectors consist of contiguous storage locations and the 

efficiency of the vector operations is strongly dependent on vector length 

with the maximum efficiency achieved for very long vectors For vectors 

of length 1000 around 90% efficiency Is obtained, but this drops to 

approximately 50% or less for vectors of length 100 and less than 10% 

for vectors of length 10 Hence, we would like to keep vector lengths 

on the order of 1000 or more whenever possible Now for (5 9) sup- 

2 4 

pose that h=01 so that N=99 and n=/V -10 If we consider the boun- 

dary values of the square region to be unknowns and order the grid 

points, including the boundary points, from left to right, bottom to top, 

2 

the unknown vector n in (5.4) will have length OV+2) and the new 
k i-l 

iterate u can be completely vectorized as a matrix vector product fol- 
lowed by the addition of two vectors Also note that the relaxation 

parameter in (5 6) causes no problem and hence the JOR method is 
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implemented in the same fashion However, the boundary values must 

not be changed by the iteration and this is prevented by use of the 
control vector feature on the CYBER 203/205 which allows suppression of 
storage of updated values into the boundary locations. (See. e.g. Lam- 
blotte [1975] or Ortega and Voigt [1977] for more details on this pro- 
cedure.) Since the calculation of new values corresponding to the boun- 
dary points Is superfluous, an inefficiency of approximately 4% for N=99 

is Introduced; however, almost full efficiency of the vector operations 
results. 

Next, we describe the implementation of (5.10) on the Finite Element 
Machine. Now. the grid point stencil for (5 9) is given in Figure 1. 



Figure 1. Stencil for (5 9) 


and matches exactly the eight local neighbor connections of the FEM 
that was discussed in Chapter 3. Hence, if we have as many FEM pro- 
cessors as the N interior grid points, each interior point and its associ- 
ated row of K matrix coefficients and ]_ vector component could be 
assigned to one processor. The boundary nodes would not be assigned 
to processors, but Instead their values are stored in the processors 
which need them. The data communication between processors can be 
done completely with the local communication links and the convergence 
flag In all processors is checked by the signal flag network Let u 

and denote the portions of il that are assigned to processor p and 
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the logical neighbors of processor p respectively; then, the algorithm that 
Is executed in each central processor is given below: 


For *=1.2 -> max «o 

(1) Solve for ■ 

If 4.1 

(2) Send iLp to the logical neighbors 

via the local links and global bus if needed. 

(3) If JJ “ £p || „ < € raise the convergence flag. 

(4) If the convergence flag is raised in all processors then stop 
else continue. 


(5) Receive n n from the logical neighbor 

processors via the local links and global bus if needed. 

Algorithm 1. Parallel Jacobi (One pomt/processor) 

However, In practice it will most certainly be the case that the 
number of Interior grid points N will greatly exceed the number of pro- 
cessors p. For this situation, we simply assign [ N/p] points .per pro- 
cessor in such a way as to take advantage of the local links of FEM 
For example, suppose that N=4p. Then we assign the grid points to 
the processors as shown In Figure 2. 
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Figure 2. Processor Assignment for Jacobi's Method 

and note that each processor must be connected only to its eight 
nearest neighbors since each point is connected to its eight nearest grid 
points as shown in Figure 1. Hence, only the local communication links 
will be required for communication. Algorithm 1 is then modified as fol- 
lows. 


For k = '\.2....k 


max 


do 


( 1 ) 

( 2 ) 


Solve for each component of si 


k + 1 


irr sequence 


k tl 

Send the necessary components of u. to the local neighbors 
via the local links (Only local links *are needed for the stencil 
of Figure 1) 


(3) If 



< € raise the convergence flag. 


(4) If the convergence flag Is raised in all processors then stop 
else continue. 

k + i 

(5) Receive u from the logical neighbor processors via the local 
links. (Omy 1 local links are needed for the stencil of Figure 1) 


Algorithm 2 Parallel Jacobi (Multiple points/processor) 
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For processors on the border of the processor array, values will not 
be sent to and received from all eight of the neighbor processors and 
consequently, the algorithms In these processors may be different to 
reflect this; or alternatively, the same algorithm could be used in all 

processors with a test included to determine a processor's position in 

the array. A third option would be to maintain the same algorithm in 
each processor and provide each processor with the appropriate lists of 
unknowns and associated processors to send data to and receive data 
from. This was the approach taken in Chapter 4 where the connectivity 
of the grid points was determined by the assembly process If this con- 
nectivity data were coupled with an algorithm that assigns the points to 

the processors, the appropriate information for communication would be 
available to each processor and the algorithms in ail processors would 
be the same 

Algorithm 2 will allow each processor to run without waiting on other 

processors with the exception of the synchronization in step (5) and the 

convergence test In (4). Because the processors may complete the 
k tl 

updating of a In different times due to a number of factors, slightly 
different clock times In the processors, different memory access times, 
especially for those processors connected to the boundary, different 
number of unknowns per processor if p does not evenly divide N. syn- 
chronization of the processors to some degree is realized by the syn- 
chronous RECEIVE command which causes processors to wait until the 
value to be received is available before computation continues This, in 

effect, allows the slower processors to catch up and also ensures that 
the same answer will be obtained for the problem on the processor 
array as on a single processor Note that the processors are not 
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required to operate In a SIMD or lockstep fashion, but the information 
for the next Iteration must be obtained from neighbor processors before 
the iteration continues. If we relax this requirement and use an asyn- 
chronous RECEIVE, the processors may run asynchronously and the delay 
times will be reduced. The numerical iterates will however deviate from 
the true mathematical Iteration but Baudet [1978] shows that this may be 
beneficial 

The second source of delay is the convergence test for the iterative 
process. The local convergence test in (3) of Algorithm 2 can be done 
in all processors simultaneously and therefore incurs no delay. However, 
at the end of each iteration, the convergence flag must be checked in 
all processors as indicated by (4) of Algorithm 2. If all the flags are 
not set. the processor continues with the next iteration. Hence, the 

entire process will not terminate until all unknowns have satisfied the 

convergence criterion and towards the end of the process a portion of 
the processors may be doing unnecessary work. This seems to be an 

unavoidable inefficiency. 

in the absence of these delays, if p evenly divides N. and if the 

processors operate at the same speed, the Jacobi method on the Finite 
Element Machine will have speedup O(p) The actual speedup, however, 
will be a function of the ratio of the communication to calculation times 
of the processors and is discussed in Chapter 7. 
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5.2. The Multi-Color SOR Method 


5.2.1. Motivation 

Let the matrix K. be split as given by (5.2). Then the SOR Iteration 
applied to (5.1) is given by 

— (D -UiL )iL +1 = l{uiU+a-u))D]iL+L (5.11) 

w w 


or 


k + 1 


■ L J‘ + £ 


(5.12) 


where 


c=(jj(.D-wL ) 



(5.13) 


L =(D-utL)~^(uU+Ci-ut)D) 
uj 

is called the SOR iteration matrix and u) is the relaxation parameter 
chosen to enhance convergence. 


The conditions for (5.11) to converge for symmetric matrices with 
positive diagonal elements is given by the Ostrowski-Reich Theorem 
(Vargatl962}). 


Ostrowski - Reich Theorem 

Let K be a symmetric matrix with positive diagonal elements. 
Then the SOR method converges if and only if K is positive 
definite and 0<u<2. 


Since the problems of interest to us are symmetric and positive definite. 
SOR is guaranteed to converge if we choose 0 <cj<2 
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The SOR iteration (5.12) can be written for implementation as 


ktl ... , k 

Uj = (1 


il 


I - 1 

- E K 

/ = i 


/f+i 

n u i 


rt 

- E 

/=»+! 


. k, 
k u J 
il I 


(5.14) 


This form shows that the SOR iteration is sequential in nature since the 

values of u . /= 1.2 /-I must be computed before u ( on iteration k+ 1. 

This was not true for the Jacobi Iteration of (5.10) where only previously 
computed values were required for the update of a given component of 
U.. Despite this sequential nature, several authors (e.g. Hayesll974], 
Lambiotte[1975]) have observed that if (5.1) arises from a five-point 
difference discretization of Poisson's equation and the equations are 
ordered according to the classical Red/Black partitioning of the grid 
points then an SOR sweep may be carried out. in essense. by two 
Jacobi sweeps, one on the equations corresponding to the red points 
and one for the equations corresponding to the black points. Thus, in 
this case, the SOR method can be effectively implemented on vector or 
parallel computers. 

On vector computers, all the unknowns associated with the red grid 
points would be combined into one long vector and similarly for the 
unknowns associated with the black grid points For parallel arrays, an 
equal number of red and black equations would be assigned to each 
processor. The SOR Iteration would be comprised of two Jacobi sweeps, 
one Red sweep followed by one Black sweep with each sweep performed 
simultaneously by the processors. After each sweep, the updated values 
of the respective color would be communicated between processors 
After the Black sweep, and hence one SOR iteration, the convergence 
test would be performed as described in the last section 
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This strategy does not work, however, for higher order finite differ- 
ence or for finite element discretizations for more general elliptic equa- 
tions which contain cross partial derivative terms. In these cases. It Is 
necessary to generalize the Red/Black partitioning of the grid points to a 
"Multi-color" partitioning; for example, a three color partitioning, say 
Red/Black/Green, might give the desired result. In general, the number 
of colors necessary will depend on the connectivity pattern of the grid 
points, if p colors are used, an SOR sweep can be implemented by p 

Jacobi sweeps, one for each set of equations assoicated with a given 
color For vector computers, this reduces the effective vector length to 
Oin/p ) while for parallel arrays it is necessary that each processor hold 
a multiple of p equations where this multiple will be determined by the 
particular discretization. Clearly, there will be a point of diminishing 
returns as p increases, but for most differential equations and discretiza- 
tions of Interest it seems that no more than 6 colors will suffice and for 
the size of n we have in mind (n ® 10,000 t), the Multi-color strategy 
can be very effective. 

We note that the Multi-color orderings for SOR have been used 
before (Young[19711. Hackbush[1977], Hotovy and Dlckson[1979]) but not in 
context of parallel computation for finite element discretizations. 

In the next section, we describe the method in more detail and give 
the appropriate coloring (ordering) of the grid points for several finite 
difference and finite element discretizations and discuss how to implement 
the resulting Multi-color SOR method on parallel computers. In Section 
5 2 3 we compare the Multi-color SOR method to existing theory, and in 
Section 5 2.4 we give numerical comparisons of the Multi-color ordering 
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with the lexlographical (rowwise) ordering of the grid points. 

5.2.2. Multi-Color Orderings 

As a first example, we consider the elliptic equation (5.8) that is 
discretized as given in (5.9) and partition the grid points by the 
Red/Black scheme as shown in Figure 3. We then number the Red grid 
points from left to right, bottom to top followed by the Black grid points 
in the same fashion. 


° R 
° B 
° R 
° B 


° B 
8 R 
8 B 
8 R 


8 R 
8 B 
8 R 
8 B 


8 B 
8 R 
8 B 
8 R 


Figure 3. Red/Black Ordering 


Now if a=0. so that (5 8) is just Poisson's equation, then (5 9) represents 
the usual five-star discretization of (5.8). It is well-known (see e.g. 
Young [1971]) that the difference equations (5.9) may be written In the 
partitioned matrix form 

“r 

where D is a diagonal matrix and ^ and denote the vectors of 

unknowns associated with the red and black grid points respectively. 
The Qauss-Seldel iteration for (5.15) may be written as 


Ar 


(5.15) 
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_ kt 1 „ k ^ . 

Diir = ~ B % + K 

(5.16) 

d 4 +1 = - e \* +1 + ^ 

and each part of (5.16) can then be effectively implemented in a parallel 
fashion, with the introduction of the SOR parameter causing no problem. 

If a* 0, the form (5.15) of the difference equations is still valid 
although D is no longer a diagonal matrix. Hence, the unknowns 
corresponding to the red points are coupled to each other in (5.16) and 
likewise for the black points: whereas for a=0 they completely uncouple. 
The result is that (5.16) is no longer implementable in a parallel fashion. 
This is illustrated by the grid point stencil for (5.9) with the Red/Black 
ordering as shown in Figure 4. 



Figure 4 . Stencil for (5.9) 


The center Red point can be seen from Figure 4 to be connected to 
the Red points at the four corners, and a similar stencil is obtained for 
the Black center points. 

We wish to introduce another partitioning of the grid points for 
which unknowns within each subset of the partitioning are uncoupled. 
This is possible only if the graph associated with the discretized domain 
can be colored with p colors so that nodes of a given color have no 
edge between them A graph with this property will be called p-partite 
which is formally defined below 
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Definition 2 

A graph G(V.E) with a set of vertices V and edges E is p- 

partlte if its vertices form p disjoint subsets . S 2 S p . 

P 

with u S.-V such that If uveE(G) then ueS, and v eS for 
some / +i . 


Examples of a bi-partite and a 3-partite graph are given below: 




Definition 1 requires that nodes within the same subset are not con- 
nected by an edge: however, no restrictions are made on the number of 
nodes in a subset S. that can have edges to nodes in subset S . In 

I P\ 

fact, all I g I pairs of subsets could be connected by an edge from any 
node in one subset to any node in another In this case, a p-partite 


graph would consist of 



bi-partite graphs. 


For example, for the stencil of Figure 4. if we use four colors, we 
can partition the grid points into four subsets labeled Red. Black. White. 
Orange so that each center point connects with only points of different 
colors. A suitable coloring for the stencil in Figure 4 is illustrated in 
Figure 5. We note that the coloring repeats beyond the given subregion. 
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° R ° B ° R ° B 8 R °B 

°w °o °w °o 8 w °0 

°R ° B 8 R ° B ° R °B 

8 W °0 8 W °0 °W °0 

° R 8 B 8 R °B * R ° B 

8 W °0 8 W 8 0 8 W ° O 

Figure 5. Four color partitioning of the gndpoints 

In this case, the system (5.9) can be written in a partitioned form 
analogous to (5.15) as 

D 1 S 12 S 13 S 14 u r 

fl 21 °2 a 23 8 24 ^ 

S 31 S 32 °3 S 34 "w 

S 41 fl 42 S 43 °4 

where D 1 . D 2< Dg. and D 4 are diagonal matrices. The Qauss-Seidel 
iteration in terms of (5.17) is then 

°1^ +1 " ^lA ‘ 8 13^w * 8 lA + A- 
°2S +1 = -S 21~r +1 ” S 23“i ’ 8 2 4^o + ^ 

If ^ 

with similar equations for si^ and n Q 

Now. since the D / are diagonal. (5.18) is easily implementable on 
parallel architectures by taking four sweeps of Jacobi's method to 
comprise one Gauss-Siedel iteration, in particular, for vector computers, 
the vectors are of length Oin/ 4) and the update of the vectors j^. u ^ . 
and n Q must occur in sequence with each vector update being fully 
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vectorized into matrix-vector multiplications and vector additions For 
parallel arrays, the grid points must be partitioned into subsets and each 
subset assigned to a processor. The primary goal of this assignment 
for a machine such as the Finite Element Machine, or on a similar 
array with perhaps many more processors but limited processor to pro- 
cessor interconnections. Is to keep as many processors as possible run- 
ning at a given time. This, in turn, requires maximum use of the pro- 
cessor interconnections and minimum use of the global bus since con- 
tention for the bus will tend to introduce delays which cause processors 
to be idle. 

This objective can be achieved by ensuring that each processor 
holds at least as many unknowns as a certain multiple of the number of 
colors where the multiple is the number of rows above the center point 
In the grldpomt interconnection stencil. Also, we would like to ensure 
as much as possible that all processors hold the same number of each 
color of grid points, thereby increasing the likelihood that all processors 
finish each Jacobi sweep on a particular color at the same time so as 
to reduce delays in data communication between sweeps. 

Figure 6 shows the assignment of the grid points of Figure 5 to the 


processors. 
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Figure 6. Processor Assignment for (5.19) 


Each processor in Figure 6 holds an equal number of Red, Black. 
Orange, and White points. If fewer processors are available, we can 
assign a 2kx2l block (instead of a 2x2 block) of points to each proces- 
sor since the same number of each color of points will be in any two 
disjoint blocks. During the solution of (5 18). the processors In the inte- 
rior of the processor array communicate with all their eight compass 
point neighbors as can be seen from Figure 6 above and the grid point 
stencil In Figure 1. On the Finite Element Machine this communication 
can be done via the local communication links and no use of the global 
bus will be necessary Each boundary processor will communicate with 
fewer than eight neighbor processors, the exact number depending on its 
location. 

Let p and sl c n denote the portion of nodes of color c assigned 
to processor p and the portion of nodes of color c that are needed by 
processor p for the calculation of u but reside In other (perhaps 

neighbor) processors respectively. The Multi-color SOR algorithm that is 
executed by processor p is given below: 
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For k=1 ' 2 k max 60 

(1) For c=1.2...,nc do 


... „ , . /c +1 

(1) Solve for sl c p 

k + 1 

(2) Send necessary portion of u to logical neighbors. 

c #p 

(3) Receive u from logical neighbors. 

c ,n 


(2) If 


k + 1 
L c.p 



< e set the convergence flag. 


(3) If all processors have convergence flag set then stop. 
Algorithm 3. Multi-color SOR 


We now give the coloring of the grid points and the associated pro- 
cessor assignment for some common finite difference and finite element 
discretizations. First, consider the nine-point discretization illustrated by 
the stencil in Figure 7 



Figure 7. 9-poxnt Discretization 


The grid points are partitioned into three subsets by using the three 
colors Red. Black, and Green as shown in Figure 8. 
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# B ° G 8 R 8 B ° G 8 R • B 8 G 8 R 

•g 8 r # b 8 g °r °b 8 g °r 8 b 

° R 8 B * G * R 8 B °G °R 8 8 °G 
°B 8 G 8 R 8 B 8 G 8 R 8 B 8 G 8 R 

8 G 8 R 8 B 8 G 8 R 8 B 8 G 8 R 8 B 

Figure 8. 3 -Coloring for Figure 7. 


The points can be assigned to processors in blocks of size tor 3/ with a 
minimum block size of lx 3 as shown in Figure 9. 



Figure 9 . Processor Assignment for Figure 8 . 


With this assignment the North. South. East, and West local links of FEM 
can be used but the global bus is needed to communicate values to the 
next North, next South, next East, and next West neighbor processors. If 
the blocks were instead sized with fc>1. only the eight local communica- 
tion links are required. 

Secondly, consider the thirteen-pomt discretization that is often used 
for the bi-harmonic equation and is illustrated by the grid point stencil 
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in Figure 10. 



Figure 10. 13-point Discretization 


The grid points are partitioned into six disconnected subsets by the use 
of the six colors Red. Black. White. Orange. Yellow, and Purple as shown 
in Figure 11. 
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Figure 11. 6-Coloring for Figure 10. 


In order to maintain the same number of each color in two distinct pro- 
cessors. the points must be assigned in blocks of 2Jcx3 / with the 

minimum block size being 2x3 as shown in Figure 12. Note that only 

the eight local communication links of FEM are required. 
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Figure 12. Processor Assignment for Figure 11. 


We now consider rectangular domains that have been discretized by 
finite elements. Triangular elements with associated piecewise continuous 
(C°) linear basis functions defined at the three vertices and their asso- 
ciated gridpoint stencil are shown in Figure 13. 



Figure 13. Linear Triangular Element and Grid Point Stencil 


The center point of the stencil is connected to the six points that share 
a common triangle. For this discretization, the grid points can be parti- 
tioned into three disconnected subsets by using the colors Red. Black, 
and Green as shown in Figure 14. 
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The grid points are then assigned to processors in blocks of size kx3l 



The local communication links of each FEM processor that are needed 
for this assignment are the North. South. East. West. Northwest, and 
Southeast links. 

Next, consider a triangular element with C° -piecewise quadratic 
basis functions defined at the vertices and midpoints of the triangle 
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This element and its associated grid point stencil are illustrated in Fig- 
ure 16. 



Quadratic Element Stencil for Nodes 1,2, and 3 


O 



Stencil for Node 4 Stencil for Node 5 stencil for Node 6 
Figure 16. Quadratic Element and Grid Point Stencils 


For this stencil, the gridpoints may be partitioned Into six disjoint subsets 
with the colors Red. Black. Green. Orange. Yellow, and Purple as shown 
in Figure 17. 
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Figure 17. 6-Coloring for 


Quadratic Elements 


The grid points are then assigned to processors in blocks of size 2 kx6i 
with the minimum block of size 2x6 as shown in Figure 18 Ail eight 
local links are required for this processor assignment. 



Figure 18 . Processor Assignment for Figure 17 . 


We now consider two examples of higher order finite elements that 
are used to discretize 4th order partial differential equations. The first 
example is the (function and its first partials are continuous) bi-cubic 
rectangle (see Becker and Oden[1981]> A cubic in x and y can be 
uniquely 
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determined by 16 constants. Therefore. If we prescribe the values of 

du h du h 

the unknown at a grid point. t/ ft . Its partials in x and y. and 

JZ h 

and Its second partial — — at the four corners of the rectangular ele- 
ment as shown In Figure 19, 



Figure 19. Bi-Cubic Rectangle and Grid Point Stencil 


the basis functions at each grid point will be bi-cubic polynomials which 

3 h 

will have continuous partials across element boundaries where n is 

the normal to a common side. The stencil in Figure 19 is the same 

stencil as the stencil of Figure 4. therefore Figures 5 and 6 give the 

appropriate coloring and processor assignment for grids that are discre- 
tized by this element. 

Lastly, we consider the C 1 qulntlc triangle. Oden [19811. which is 

shown In Figure 20. 



Figure 20. Qulntlc Triangle 
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du h 

each vertex of the triangle and the value of the normal derivative at 
the midpoints of each side of the triangle. The nodes In Figure 20 
have the same connectivity as those of the C° quadratic triangular ele- 
ment and the stencils of Figure 16. Therefore. Figures 17 and 18 give 
the appropriate coloring and processor assignment for grids that are 
discretized by this element. 

5.2.3. Comparison to Existing Theory 

In this section we explain what is meant by a p-Colored matrix and 
show how matrices of this type relate to the consistently ordered (CO), 
the q-r consistently ordered (CO(q.r)), and the q-r generally consistently 
ordered (GCO(q.r)) matrices of YoungI1971J and the p-cyclic matrices of 
Varga[1962J. For these matrices of Young and the p-cycllc consistently 
ordered matrices of Varga a well known theory exists for determining the 
optimum relaxation factor u for the associated SOR iterative method. We 
show. In general, that p-Colored matrices do not fit Into any of these 
classifications; however, CO. CO(q.r). and certain p-cycllc matrices can 
easily be permuted Into a p-Colored matrix. 

The notion of a p-Colored matrix is directly related to that of a p- 
partlte graph as was given by Definition 1. Recall that nodes within the 
same subset of a p-partlte graph are not connected by an edge; how- 
ever. no restrictions are made on the number of nodes in a subset 5 / 
that can have edge connections to nodes In subset Sj. 

By numbering the equations associated with nodes in subset . In 
any order, followed by the equations associated with nodes In subset Sg. 
S_, ... , and finally subset S . the result Is a p-Colored matrix K which 

O p 

has the following form: 
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°11 

CM 

• x lp 

*21 

CM 

CM 

Q 

• X 2p 

• a 

*pl 

CM 

x a 

• • 

• V 


where the D ^ are block diagonal matrices. 



(5.19) 


with each entry E j being a square matrix representing the equations at 
only one grid point of the associated problem domain that has been 
colored with p colors and nCj nodes of color / 


For the special case of one equation per grid point, say Laplace's 
equation for example, the D . will be diagonal matrices. As was noted 
In Chapter 2. the plane stress problem has two unknowns per grid point; 
consequently for this problem, the £ f will be 2x2 matrices and the 2 
equations at the same node will have the same color. 


We now compare p-Coiored matrices with diagonal D jt blocks to the 
CO(q.r) matrices of Young. Now. a test for determining whether a matrix 
K Is a CO(g.r) matrix is given by the following definitions and theorem 
by Young. 


Definition 2 . ( Young) 

For given positive Integers q and r. the matrix K of order N 
is a (g ,r)-conslstently ordered matrix (CO(g ,r)-matrix) if for 
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» 

! 


some t. there exists disjoint subsets S r S 2 ,*-*,S f of 

W a {l.2 /v} such that £ S. =W and such that: If k..+Q and 

A = 1 11 

/</. then ••• +s f _ f and icS k+r . where is the 

subset containing /; if *^#0 and />/. then 

/€S g +1 +s < 7+ 2 + +s f an<1 /€S /f- q wh0re s * is the subset 

containing /. 

Definition 3. (Young) 

The vector ?=<? r y 2 , * * ’>V n )T - where ./=1,2 N are 

integers. Is a (g.r) compatible ordering vector for K if for any 
I and / such that *^#0 then 


y i ~ y l s ~ q lf '*< 

A CO(I.I) matrix Is called CO. or consistently ordered. 

Theorem J_. (Young) 

The matrix K is a CO(q.r) matrix if and only if there exists a 
compatible odering vector for K 


By using Theorem 1 it is very easy to conclude that the p-Coiored 
matrix In (5.19) is In general not a CO(q.r) matrix. In particular, let 



* 0 
* 0 
* 0 


(5.20) 


Then if a compatible ordering vector exists for (5.19). we must have 
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from Theorem 1 that 

7 2 - ^ = r 

y 3 ’ y l = f 
or 

y 3 - 7 2 = 0 (5.21) 

But. since x 32 * 0 ’ we must re( l uire 

7 3 - 7 2 * -q (5.22) 

Since (5.22) conflicts with (5.21). (5.19) is in general not a CO(q.r) 
matrix. The same technique can be used to show that the 4-Colored 

matrix for Figure 6. the 3-Colored matrix for Figure 7, the 6-Colored 

matrix for Figure 11. the 3-Colored matrix for Figure 14. the 6-Colored 

matrix for Figure 17. the 4-Colored matrix for the stencil of Figure 19. 

and the 6-Colored matrix for the stencil of Figure 22 are not CO(q.r) 
matrices. 

On the other hand, if the matrix K is a CO(q.r) matrix, we show in 
the next theorem that K Is permutatlonally similar to a p-Colored matrix. 
Before proving the theorem, we recall the following definitions of Young 
and Varga 

Definition 4 (Young ) 

Given the positive integers q .r . and t. the matrix K is a 

T(g.r.f) matrix If it can be partitioned into the txt block form 

K=(K^) where, for each /. K^=Dj Is a square diagonal matrix 

and where all other blocks vanish except possibly for the blocks 

K A . /=1.2 t-r. and K . . /=d + l.gt2. ..t 

i.i+r i.i-q 


The matrix in (5.23) is an example of a T(l.s.r) matrix. 
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(5.23) 


where the K (/ matrices are diagonal matrices. 

Definition 5. (Varga) 

Let K be partitioned as 

K 11 K 12 * K lp 

K 21 K22 * «2p 

K = (5.24) 

r\ “ • • . 

• • • • 

K P 1 K p2 ’ K pp 

m m 

If the Jacobi matrix B-l-idiagiK )) ^ K is permutatlonaliy similar 
to a T(l.p-I.p) matrix, then K is p-cycltc relative to the parti- 
tioning (5.24). 

Theorem 2. 

Let K be a CO(q.r) matrix and let p'=(g+r)/c/ where d is the 
largest common factor of q and r. Then there exists a per- 
mutation matrix P such that P ^KP is a 


(1) 

p -Colored 

matrix 




(2) 

2-Colored 

matrix if 

t 

P 

is 

even 

(3) 

3-Colored 

matrix if 

» 

P 

is 

odd 


Proof . 

Since K is a CO(q.r) matrix, K has Property A r (see Young). 
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/ 

Therefore K also has Property A. p -1. Now. there exists a 

* 0 

permutation matrix P such that P~^KP is a CO<1.p'-U matrix. 
Furthermore. Young shows that a CO(l.s) matrix is also permu- 
tationally similar to a T(l.s.t) matrix with possibly certain rows 
and corresponding columns of blocks deleted. Now, the adja- 
cency graph associated with the Jacobi matrix for the TCl.s.f) 
matrix of (5.23) is shown below, where we denote ail the vari- 
ables associated with by 1. the variables associated with D 2 
by 2 ,..., and finally those associated with D f by r. 

f — — > — >f-s-l — > >1 


If we color these t blocks with p' colors from right to left as 
C-j/Cg/* • •/C p '/C 1 /C 2 /* • */C p '/C^/ etc. and group together all 
the blocks of the same color and then order the matrix by 
groups, the resulting matrix will have the form 



X , , D, 

p .p -i p 


which is easily seen to be p -Colored as well as p-cycllc and 
(1) follows. 


We next prove statement (2) of the theorem. If s is odd. these 
f blocks can be colored R/8/R/B.. from right to left All the R 
blocks can be grouped together and the same for the B blocks 
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so that the resulting matrix has the form 



which shows that K is 2-Colored. 


If 5 is even, the t blocks are colored R/B/G/R/B/Q. . from right 
to left. Furthermore, if p' is a multiple of three. K has the 
form 


K = 




which is 3-Colorable and also 3-cyclic whereas, if p ' is not a 
multiple of three. K has the form 


K = 




which is not 3-cyclic but is 3-Colored, 
the theorem follows 


Hence, statement (3) of 


We now compare p-Colored matrices to the p-cyclic matrices of 
Varga for the case where the 0.. matrices in (5.19) are diagonal. From 
the form of a T matrix given In (5.23). it Is readily seen that a p- 
Colored matrix Is not in general p-cyclic. On the other hand, if the 
matrix of (5.24) is p-cyclic it is also p-Colored In fact, we can use 
Theorem 1 to show that a p-cyclic matrix is permutationally similar to 
either a 2 or a 3 Colored matrix 
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Corollary L 

Let K be a p-cycllc matrix with the matrices being diagonal 
matrices. Then there exists a permutation matrix P such that 

P~ } KP is a 

(1) 2-Colored matrix If p is even 

(2) 3-Colored matrix if p is odd 

Proof : 

Since K is p-cycllc it is permutationally similar to a T(l,p-l,p) 
matrix. The conclusion follows directly from the proof of 

Theorem 2 after noting that s=p-l and t=p. 

Corollary 1 Implies that p-cyclic matrices for which the diagonal 
blocks are diagonal can be reordered to yield 2 or 3 diagonal blocks on 

the diagonal. This means that for a vector Implementation of the Multi- 
color SOR method, the associated vector sj. for the solution of (5.1) can 

be partitioned Into 2 or 3 long vectors rather than p shorter ones. 

However, we note that the resulting 3-Colored matrix in (2) of Corollary 
1 may not be 3-cycllc and hence no known theory exists to aid in the 
selection of the optimal relaxation factor u. This fact is possibly offset 
by the much longer vectors that will result if p>> 3. Barlow and 

Evans[19821 mentions that p-cycllc matrices may be colored with p 

colors but does not mention the possibility of fewer colors. 

Lastly, we discuss the relationship of p-Colored matrices (again with 
the Dji In (1) being diagonal blocks) to Young's generally consistently 

ordered. QCO(q.r), matrices. First, we give the definition of a GCO(q.r) 


matrix. 
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Definition 6 (Young ) 

A matrix K is a GCO(q,r) matrix if 

det(a*L+a r U-kD ) 

is independent of a for all a#0 and for all k where D.-L.-U 
are the diagonal, strictly lower and strictly upper parts of K 
respectively 

Definition 7. (Young) 

A real matrix K of order N is an L- matrix if 

k. ; >0. / =1 ,2 N 

and 

k t ^<0, i *!. i ,/=l,2 N 

Young also gives the relationship between GCO(q.r) and CO(q,r) matrices 
in the following theorem 

Theorem 3 (Young ) 

If K is an irreducible GCO(q.r) matrix which is an L matrix then 
K is a CO(q,r) matrix. 

Hence, matrices which are both L and GCO(q.r) matrices are permutatio- 
naly similar to either a 2 or a 3-Coiored matrix by Theorem 2 The 
2-Colored matrix will be consistently ordered but the 3-Colored matrix 
may not be q-r consistently ordered as was shown in the proof of 
Theorem 2. 

Since the matrix K is symmetric for our problems, we are interested 
m the relationship of symmetric GCO matrices to CO(q.r) matrices. 
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Lemma 


Proof - 


l 

If K is a symmetric CGO(q.r) matrix then g=r and K is a 
GCO(l.l) matrix. 


Since K is a CGO(q.r) matrix. deUa^L+a r U-kD ) is independent 
of a for ail a#0 and for ail k. Recall that the determinant of 
an NxN matrix is the sum of Nl terms of the form 

f(a)=s(a)/c 1 a(1) *i a(2) * • ' k Ni0iN) (525) 

where 5 (a) is 1 if the sequence a(1),o(2)....o(A/) can be put in 

the form 1,2 N by an even number of interchanges of any 

pair of elements in the sequence and -1 otherwise. 

Now. all the terms that are multiplied by a q r are of the form 

s(o)a q ~ r l-kd.)(-kd„). .k ,...k ...A-kd ) (5 26) 

i dill' n 

where is the ith entry of the diagonal of K and only k^ 
and kj. need to be interchanged for the sequence 

a(Do(2)...a(/V) to be in the order 1.2 N. Hence, all these 

terms have s(o)=-1. In addition, since k s k .. all these terms 
are of the same sign and their sum can only be independent 
of a if a q r is independent of a which is true only if g=r. 
Now. detCa r L+a r U-kD ) can be written as 

det((a r ) 1 L +(a f ) -1 Lf -kD ) 

and is also independent of a for all a#0 and for all k 


Therefore, we conclude that K is a GCO(I.I) matrix. 
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Definition 8. 

A symmetric GCOCl.l) matrix Is an SGCO matrix. 

Next, we give the relationship between SGCO matrices and 2-Colored 
matrices. 

Lemma 2. 

Let K be an irreducible L matrix. If K is an SCGO matrix then 
there exists a permutation matrix P so that P _1 KP is a 2- 
Colored matrix. 


Proof : 

From Theorem 3 it follows that K Is a CO(l.l) or equivalently a 
consistently ordered (CO) matrix. It Is well known that any CO 

matrix can be permuted to the R/B or 2-Colored form and the 

theorem follows. 

The contrapositive of Lemma 2 states that If K is a symmetric L matrix 
that is not consistently ordered it can not be generally consistently 
ordered. This means that we can not simplify the determinant in Defini- 
tion 6 for symmetric L matrices that are not consistently ordered in 
order to relate the eigenvalues of the Jacobi and SOR iteration matrices 
associated with K , and hence determine the optimum relaxation factor w 
for the SOR Iteration method. However, Lemma 2 only gives sufficient 

conditions for an SCGO matrix to be consistently ordered and it remains 
to be determined whether the requirement that K be an L matrix is 


necessary 



85 


5.2.4. Comparison with Rowwise Ordering 

The Multi-color and lexiographical (rowwise) orderings were shown in 
the last section In general not be consistent orderings, therefore, we can 
not conclude that the eigenvalues of the respective SOR matrices are the 
same. The question then arises as to whether one ordering gives faster 
convergence than another However, we note that some degradation in 
the convergence rate of the Multi-color ordering can be permitted since 
It can be implemented effectively on a parallel machine whereas the 
rowwise ordering can not. 

The Multi-color and rowwise orderings were compared experimentally 
for three problems. The first problem was the five-star discretization of 
Laplace's equation on a rectangular grid with 768 unknowns. The results 
for the R/B and rowwise ordering of the grid points are given In Table 
1 Both these orderings are consistent and the results are included 
here for comparison with the next two example problems. 


Iterations 



Red/Black 

Rowwise 

1.00 

470 

542 

1.74 

73 

82 

1.76 

56 

83 

1.80 

65 

85 


Table 1. Laplace's 


Equation 


( 5-Star Discretization ) 


The second problem was a finite element discretization of Laplace's 
equation. The finite elements were triangular with quadratic basts func- 
tions defined at the vertices and midpoints as shown in Figure 16 The 
width of each triangle was taken to be h = 1/12 so that the resulting sys- 
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2 

(23j 

equations. Table 

2 gives the 

results for 

Figure 

17 and the rowwise ordering. 




Iterations 


CJ 

Rowwise 

6-Color 

5-star (I 

1.00 

563 

561 

463 

1.20 

394 

392 

324 

1.40 

266 

264 

218 

1.60 

161 

158 

132 

1.70 

113 

109 

91 

1.76 

83 

76 

64 

1.77 

76 

69 

57 

1.78 

69 

60 

57 

1.79 

62 

54 

59 

1.80 

68 

58 

59 

1.82 

75 

66 

66 

1.84 

83 

73 

75 

1.86 

94 

82 

97 

1.88 

109 

97 

98 

1.90 

117 

121 

117 

1.92 

161 

147 

146 

1.94 

194 

197 

195 

1.96 

291 

302 

293 


Table 2. Laplace ' 3 ^Equation (Quadratic Elements) 
€=10 A23i unknowns 


For this problem, the 6-color ordering and the rowwise ordering for the 
finite element discretization give very similar results and the optimal 
values of cj are the same in both cases In fact, near the optimum 
value of uj both the finite element discretizations require almost the same 
number of iterations as the 5-star finite difference discretization which is 
consistently ordered. 


The third problem was the plane stress problem described in 
Chapter 2. The plate was discretized by linear triangular elements as 
shown in Figure 14 Table 3 gives results for the Red/Black/Green ord- 
ering of Figure 14 and the rowwise ordering of the gridpomts. 
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Iterations 


w 

R/B/G 

Rowwise 

1.4 

349 

347 

1.5 

265 

263 

1.6 

169 

167 

1.61 

153 

152 

1.62 

131 

128 

1.621 

129 

126 

1.622 

127 

124 

1.623 

142 

140 

1.63 

149 

148 

1.64 

147 

145 

1.65 

141 

138 

1.66 

135 

133 

1.67 

156 

154 

1.68 

155 

154 

1.69 

153 

150 

1.7 

150 

148 

1.8 

233 

232 


Table 3. Plane Stress 
€=10-6, 60 unknowns 

Note from Table 3 that the optimum value of u is 1.622 for both order- 
ings. Also, note that the number of iterations for ou> 1.622 behaves dif- 
ferently than was seen from Table 2. For example. Table 2 showed that 
for <jj>u Q p ( the number of iterations was strictly increasing whereas in 
Table 3 the graph of w versus the number of iterations has relative 
minima at u=l 66 for example 


5.3. Multi-Color SSOR 

In this section, we describe a Multi-color SSOR method, give an 
efficient Implementation of this method on vector computers or parallel 
arrays, and give numerical comparisons to an SSOR method without 
multi-coloring for an example problem. 
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5.3.1. Description 

The SSOR iterative method for solving (5 1) can be written as the 
forward SOR iteration followed by the backward SOR iteration 

A+l 

(D-uL)u. 2 * [wU+n-wlDUi* + & 

(5.27) 

k +1 ^*2 

(D -(jjU)il - [wL+(l-w)01ii + tL 

The basic convergence theorem for SSOR iterative method (Young[1971J) 
is stated below. 


SSOR Convergence Theorem 

If K is a symmetric matrix with positive diagonal elements, the 
SSOR method converges if and only if K is positive definite and 
0<u<2. 

The SSOR method is therefore convergent for symmetric and positive 
definite matrices K. Even so. this method has been found to have a 
slower convergence rate than the SOR method for 2-Colored matrices 
Therefore, our interest in this method is as a preconditioner for a paral- 
lel conjugate gradient method, as will be described in Chapter 6. and 
not as a stand alone linear stationary method. However, even for our 
purposes, a parallel Implementation of this method is necessary. 

5.3-2. Parallel SSOR Implementation 

To solve (5.27) on a vector computer or a parallel array the equa- 
tions are first ordered so that K is a p-Colored matrix with colors C 1 . 
Cg and C p . Then the Multi-color SOR method is first applied to 
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K in a forward fashion, starting with the updating of color C^. followed 
by C 2 . until the equations of color C p are updated. Next, the Multi- 
color SOR method is applied to X in a reverse fashion starting with 
color C . followed by C p _-| until the equations of color C 1 are updated. 
After the reverse SOR pass Is completed, and hence one SSOR iteration, 
if the convergence test is met, the Iteration stops, otherwise, the process 
is repeated. For parallel arrays, after the values of each color C. are 
updated on both the forward and reverse pass they must be communi- 
cated between neighbor processors. The Multi-color SSOR algorithm is 
given below: 

For k=l ,2 k do 

max 

(1) For c=1.2 nc do 

(1) Solve for ^ +1 

c.p 

k +1 

(2) Send necessary portion of sl c p to logical neighbors 

k +1 

(3) Receive u. from logical neighbors. 

C #/7 

(2) For c=nc.nc-l 1 do 

(1) Solve for 

c .p 

k tl 

(2) Send necessary portion of u. c to logical neighbors. 

k +1 

(3) Receive u. from logical neighbors. 

o *n 

<3) If || u!^ + p - ul c p || ^ < e set the convergence flag. 

(4) If all processors have convergence flag set then stop. 

Algorithm 4 Multi-color SSOR 

Each iteration of the Multi-color SSOR method can be computation- 
ally expensive since it Is comprised of two Multi-color SOR Iterations 
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We now describe how to save 50% of this computational effort in the 
solution of (5.27) by using an auxilary storage vector. This observation 
is due to Conrad and Wailach [1979] Recall that the equations to be 
solved to carry out one SSOR Iteration (with cj= 1 and D-/ for simplicity) 
are 


a(/) 

U~L)iL +£, 


(I-U)sl =Ln +£. 


(5.28) 


The algorithm of Conrad and Wailach for doing multiple steps of SSOR 
is given below. 


a ( 0 ) 

( 1 ) Form Uul and store in £ . 

(This takes zero operations if the initial guess is zero. ) 


( 2 ) 


por ‘ = ’- 2 *max 


(3) Solve (l-L)u. = X+jfe as a forward Multi-color SOR pass. 

Store Lu. in £. 

(4) Solve U =£+& as a backward Multi-color pass. 

»0(+l) 

Store Uil in £. 


If a and 0 denote the number of nonzeroes in L and U respec- 
tively then (3) requires a multiplications and (4) requires 0 multiplica- 
tions. If v represents the maximum number of nonzeroes per row of K, 
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and m represents the number of multiplications per iteration of SSOR. 
then a+£< ( 17 -DA/ and 


m < ( 77 -DA/ If w=1 

m < (T7+1)A/ if U9* 1 


(5.29) 


5.3.3. Comparison with Rowwise Ordering 

It is well known, see Young[1971], that the SSOR method applied to 
a 2-Colored matrix has optimum relaxation factor w=l. whereas, if the 
grid points are ordered rowwise from bottom to top. left to right, the 
SSOR method converges faster for some w #1. It is an Interesting 
question whether the same behavior will be seen for p-Colored matrices. 
We solved the plane stress problem of Chapter 2 (60 equations) with the 
R/B/G ordering of Figure 14 as well as the rowwise ordering. The 
results are in Table 4 

Iterations Iterations 


OJ 

Rowwise 

U 

R/B/G 

.90 

589 

.950 

762 

1.00 

530 

.990 

759 

1.20 

467 

.993 

759 

1.25 

463 

.994 

758 

1.30 

463 

.995 

758 

1.35 

469 

.997 

758 



.998 

758 



1.000 

758 



1.050 

761 



1.100 

772 



1.200 

815 


Table 4 . SSOR Results ( R/B/G and Rowwise Orderings ) 
Plane Stress Problem (60 equations) 


Table 4 shows that .994 <cj< 1.01 produces the best results for the 
R/B/G ordering for the SSOR method, whereas. w=l 25 gave optimal 
results for the rowwise ordering. This suggests that the SSOR iterative 
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method for p-Colored matrices has optimal relaxation factor w=l as is 
true for 2-Colored matrices. However, this conjecture has yet to be 
proved or verified experimentally with more examples of p-Colored 
matrices. We note that even if this were true, the Multi-color SSOR 
method, with w=1, can be implemented effectively on vector computers 
and parallel arrays whereas the rowwise ordering can not. In addition, 
using <j= 1 alleviates the need to estimate the value of w which may be 
a time consuming process since little theory exists to aid in this choice 
for matrices that are not 2-Colored. 

5.4. Parallel Block Iterative Methods 

In this section we consider the Implementation of block iterative 
methods on vector computers and parallel arrays. In Section 5.4.1 we 

describe the Implementation of the Block Jacobi iterative method. In 
Section 5.4 2 we discuss the difficulties In implementing the Block SOR 

method and in Section 5.4.3 we generalize the Multi-color orderings of 

Section 5.2 2 and the p-Colored matrices of Section 5 2 3 to Block 

Multi-color orderings and p-Block Colored matrices. Lastly, we compare 
the p-Block Colored matrices to the rr-consistently ordered (7T-CO) 
matrices of Young[1971] and the p-cycllc matrices of Varga(1962] 

5.4.1. The Block Jacobi Method 

Let K be a pxp block matrix as shown In (5.24) and let the vectors 

IL and L be partitioned as Jig- ’ * ' >u -p )T and -1^, * * * .ip* 

respectively Furthermore, let 
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Then the. Block Jacobi method tor solving (5.1) Is 

0i/ +1 = (L+LDil+L (5 31) 



and 8 is called the Block Jacobi iteration matrix. The Block JOR 
method is Iteration (5.32) with B replaced by 8^ where 

8 = ujB + (1-w)/ (5 33) 

Now. the iteration (5.31) can be written in implementation form as 
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vT 1 - A - ' e’v? + <5 - 34> 

Note that if the are diagonal matrices. (5 34) is just the Jacobi 
iteration method (5.3), but if the K /; are not diagonal, p systems of 

equations must be solved each iteration, one for each ^,/ = 1.2 p. 

However, these systems completely uncouple and hence can be solved 
simultaneously on parallel architectures. 

On vector computers, the right hand side of (5.34) can be formed 

as matrix vector products and vector additions and the solution of the p 

systems of equations is vectorlzable (Buzbee.Boiey,Parter[1979J) with the 

vector length equal to p. On arrays with p processors, (5.34) Is easily 

implemented by assigning processor / to the calculation of u. t - Once 
k +1 

Is calculated, the appropriate components are sent to neighbor 

U -LI 

processors and the appropriate components of u. are received from 

neighbors for use In the next Iteration The p processors then complete 
the calculation of one Iteration in the time it takes the processor with 
the most unknowns to complete its calculation. If each processor has 

the same speed and the same number of unknowns. O(p) speedup can 
be achieved with this approach 

5.4.2. The Block SOR Method 

The Block SOR method for solving (5.1) is 

1 k+l 1 k 

= -kwU+n-(j)DUi. + L (5.35) 

W U) 

or 

fc + 1 , k , 

H = LJL + jC. (5 36) 


where 
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L = (D-wL)‘ 1 [wU+(l-w)0] 

W -1 
c =utD-uL) L 

and L Is the Block SOR Iteration matrix. 

(jj 


The Implementation form of (5.35) Is given by 

V.*' = V^vT * 

k+1 

and u ,/=1.2 p is solved in sequence, first n . , followed 

P ' 

and finally n p . 


(5.37) 
by 


The algorithm given by (5.37) is sequential and can not be com- 
pletely vectorized or implemented on parallel arrays. However, it Is well 
known that for some discretizations of partial differential equations a re- 
ordering of the grid points results in a block matrix for which the equa- 
tions in (5.37) uncouple. In particular, consider the grid point stencil of 
Figure 4. If we color the even rows of points Red and the odd rows of 
points Black as shown in Figure 21. 


° R 
° B 
° R 
° B 
° R 
° B 


° R ° R ° R 

° B ° B ° B 

° R ° R ° R 

° B ° B ° B 

° R ° R ° R 

° B ° B ° B 


Figure 21. Line Red/Black Ordering 


group all points in a given row into one block, and then number the 
red blocks first from bottom to top. followed by black blocks, the matrix 
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K for Figure 21 has the form 


K 





X 

X X 




(5.38) 


The SOR Iteration (5.37) Is the classical Red/Black line SOR which is 
composed of two block Jacobi sweeps, one on the Red blocks, followed 
by one for the Black blocks. The implementation of this method on 
vector computers is discussed by several authors 
(Buzbee.Boley.Partern9791.Nolen[1979].Parter and SteuerwalU1980], Saad 
and Sameh[1981]). For parallel arrays, every 2k rows of points are 
assigned to each processor as shown in Figure 22 for k = 1. 



Figure 22. Processor Assignment for Figure 21. 


For the assignment in Figure 22. processor i first updates the Red block 
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of unknowns, communicates these values to processor /+ 1. updates the 
Black block of unknowns and communicates these values to processor 
/-I and then checks the convergence of the process. This algorithm is 
executed in all processors with slight modifications in processor 1 and 
processor p. If the grid contains p rows, a speedup of 0(p/2) is 
achieved by this scheme. 

The same Red/Black line SOR method can be used for the linear 
triangular finite element discretization of Figure 13 and the bi-cubic rec- 
tangle of Figure 19. However, for the 9-point discretization in Figure 7, 
the 13-point discretization of Figure 16 and the quintic triangle in Figure 
20, a Red/Black 2-line SOR method can be used. For this scheme, we 
color the first bottom two rows Black, the next two rows Red. etc. as 
shown in Figure 23. 


° R 
° R 
° B 
° B 
° R 
° R 
° B 
° B 

Figure 23. 


o 


R * R ° R 

R ° R ° R 

B ° B ° B 

B ° B • B 

R ° R * R 

R ° R ° R 

B ° B ° B 

B • B ° B 

Red/Black 2-line Ordering 


and then group every two rows of Red points into one block and the 
same for the Black points If the Red blocks are numbered from bottom 
to top followed by the Black blocks, the resulting matrix K for Figure 23 
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will have the form (5.38) with 4 diagonal blocks instead of 6. The p 
rows of the problem grid are assigned to p/4 processors as shown in 
Figure 24 



Figure 24. Processor Assignment for Figure 23. 


With this assignment, a speedup of 0(p/4) is obtained 

It is well known that the K matrix assoicated with the Red/Black k- 
llne orderings is ^-consistently ordered and has the form (5.39) 


K = 




(5.39) 


where D r and 0^ represent the connectivity of the Red points to each 
other and the connectivity of the Black points to each other respectively 
(see Youngn971]) In this case, there is a theory for the selection of 
the relaxation parameter u for the associated Block SOR method which is 
briefly summarized below 
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Definition 9. (Young ) 

Let K be partitioned as in (5.24) and define a pxp matrix Z 
with elements z.j by 

Z li = 

Then K is 77-consistently ordered (.n-CO) if Z is consistently 
ordered. 

Theorem 4. (Young ) 

Let K be a positive definite n-CO matrix. Then 

(1) p(B ln) ) < 1 

l 

(2) J! = 2/n -n-p(s ( 7 r) ) 2 i 2 

D 

where is the Jacobi iteration matrix associated with the partitioning 

in (5.24). 

5.4.3. The Block Multi-Color SOR Method 

In the last section we showed how to implement either a 1 or a 2 
line SOR method on parallel arrays for all the discretizations in Section 
5 2.2. This algorithm has the advantages that a theory exists for deter- 
mining the optimum relaxation factor cj even though in practice the 
spectral radius of the Block Jacobi method may not be known in 
advance A major drawback of this implementation arises when the 
number of processors p greatly exceeds n/2 and n / 4 for the 1 and 2 
line methods respectively where n represents the number of rows in the 
problem grid In particular, these speedups are only n/2 and n/4, or 


0 if #r *0 

1 if kij+Q 
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equivalently. V77” /2 and VaT /4 when the number of unknowns is N-n^ . 
In this section, we propose an alternative blocking of the grid points that 
will give much better speedup results on a parallel array. 

5.4.3.I. Block Multi-Color Orderings 

As a first example, we consider the 9-polnt stencil of Figure 4. If 
we color the problem grid into Red/Black/White/Orange blocks as shown 
in Figure 25. 






Figure 25. 4-Block Coloring for Figure 4 

two blocks of the same color are not adjacent and hence the solution 
for blocks of unknowns of the same color in (5 37) completely uncouple. 
The blocks are assigned to processors in sizes 2kx4i so that each pro- 
cessor has the same number of blocks of each color as shown in Fig- 


ure 26. 
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Figure 26. Processor Assignment for Figure 25. 


The color pattern and processor assignment repeats beyond the subre- 

2 2 

glon shown. For this assignment with n grid points. p=n /8 and the 

2 

maximum speedup that can be achieved Is n /8. 

If the Red blocks In Figure 25 are numbered first, followed by the 
Black blocks, then the Orange and finally the White blocks, the matrix K 
will have the form 


'll 

*12 

X 13 

X 14 

21 

CM 

CM 

Q 

CO 

OJ 

X 

X 24 

31 

X 32 

°33 

CO 

X 

41 

X 42 

X 43 

X 44 


(5 40) 


- where the matrix D Is a ncxnc. block diagonal matrix of the form 
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D 


II 



(5.41) 


and nc. is the number of blocks of color / and D. . represents the 
i i .1 

connectivity of nodes of the / th block of color I to each other. Note 
that nodes In two distinct blocks of the same color are not connected; 
whereas, nodes in the same block may be connected. Matrices which 
have the form (5.40) and (5.41) will be called p-Block Colored matrices. 


As a second example, consider the 9-point discretization of Figure 
7. The points are colored into Red/Black/Green blocks as shown in 
Figure 27. 





Figure 27 . 3-Block Coloring for Figure 7 . 


The processors are assigned in blocks of size 3kx2l as shown in Figure 

2 

28 for fc= l and /=! and a speedup of 0 in /6) is expected. Note that 
only four local communication links are used lor each processor: 
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whereas for the coloring and assignment of Figure 8 and 9 respectively, 
four links plus four more for the next North, next South, next East, and 
next West processors was required to implement the point R/B/Q SOR 
method. 



Figure 28. Processor Assignment for Figure 27. 


As a last example of finite difference discretizations, consider the 13- 
point discretization of Figure 10. The points are colored into blocks with 
six colors. Red/Black/White/Orange/Purple/Yellow as shown in Figure 29 
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° W ° W 

° 0 ° 0 

° B ° B 

° G ° G 

Op Op 

° R ° R 


o p 

° R 
° W 
° O 
° B 
° G 


° P 
° R 
° W 
° o 
° B 
° G 


° B ° B 

° G * G 

o p o p 

° R ° R 

° W ° W 

° o ° o 


° w ° w 

° 0 ° 0 

° B ° B 

° G ° G 

Op Op 

° R ° R 


Op Op 

° R ° R 

° W ° W 

° o ° o 

° B ° B 

° G * G 


° B 0 B 

° G ° G 

op o p 

° R ° R 

° W ° W 

° O ° O 


Figure 29. 6-Block Coloring for Figure 10. 


The blocks are assigned to processors In sizes of 2kx6i as shown for 

2 

k = 1 and /= 1 In Figure 30 and a speedup of 0(n /12) is expected. The 
coloring and processor assignment repeats beyond the subregion shown. 



Figure 30. Processor Assignment for Figure 29. 


We now consider the block orderings and processor assignments for 
the finite element discretizations of Section 5.2.2. The linear triangular 
element discretization of Figure 13 can be colored into Red/Black/Green 
blocks as shown in Figure 27 with the associated processor assignment 
of Figure 28. The quadratic triangular element discretization of Figure 
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16 can be colored with six colors. Red/Black/Green/White/Orange/Purpie 
as shown In Figure 31. 
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Figure 31. 6-Block Coloring for Figure 16. 


The blocks are assigned to the processors in sizes 3kx4/ as shown in 

2 

Figure 32 for k = 1 and /= 1 and a speedup of 0(n /12) is expected. 
Note that only six local communication links for the interior processors 
are used for this Implementation. 



Figure 32. Processor Assignment for Figure 31. 
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All the examples of block colorings In this section lead to K p- 
Block Colored matrices. From (5.40) and (5.41) we can easily see that 
in general. p-Block Colored matrices are not tr-CO matrices. On the 
other hand, it is a trivial observation that rr-CO matrices are always per- 
mutatlonally similar to a 2-Block Colored matrix. 

It is also easy to conclude that p-Block Colored matrices are not. 
in general, p-cycllc (relative to the partitioning (5.24)) matrices of Varga. 
On the other hand. It is an Immediate generalization of Corollary 1 that 
p-cycllc matrices relative to (5.24) are permutatlonally similar to either a 
2-Block or 3-Block Colored matrix. 

We acknowledge that in general no theory exists as of yet to help 
In determining the relaxation factor w for p-Block Colored matrices when 
p>2. but the extra parallelism that can be obtained over a k— line SOR 
method may far outweigh this disadvantage. 



CHAPTER 6 


Parallel Conjugate Gradient Methods 


6.1. The Conjugate Gradient Method 

The conjugate gradient (CG) method was proposed in 1952 by 
Hestenes and StiefelCl 9521 as a method for solving a symmetric positive 
definite NxN system of linear equations. Although it is an iterative 
method in nature, it will converge in at most N steps in the absence of 
rounding error and hence may be viewed as a direct method. 

in practice, however, the method was found to take many more than 
N steps due to this rounding error and was not competitive with Gaus- 
sian elimination But In 1971. Reldtl 9711 showed that the method could 
sometimes be used effectively as an iterative procedure for large sparse 
systems since suitable convergence may occur in far fewer than N 
steps. Several derivations and descriptions of this procedure appear in 
the literature: see for example. Chandra[19781 who studied the method for 
both finite element and finite difference discretizations of elliptic partial 
differential equations, and Schultchen and Kostem[1973] who recommend 
the method for solving the linear systems that arise from finite element 
discretizations Schreiber[1983] discussed the implementation of the CG 
method for vector computers and Podsiadlo and Jordantl981] describe its 
implementation on the FEM. We give the algorithm below and review 
some of its implementation considerations on an array processor such as 
the FEM 
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(1) Choose u. 0 

( 2 ) £?=L-Ku° 

(3) XL°=£° 

(4) k =0 


(5) For fc=0,l ,..,A 

max 


(1) 

( 2 ) 

(3) 

(4) 


a=- 


, k k . 
(L .£ ) 

, k u k. 

(fi. .Kfi, ) 


k+1 k k 
ii =il +ap. 


// 


k+1 k 
iL ~ 1L 


k+1 k „ k 
L. =£. -aKu 



then stop , otherwise continue. 


(5) 

( 6 ) 


0 - 


, k +1 k + 1 . 
(L X 1 


, k k. 
<£. X > 


k+1 k + 1 „ k 

CL =X +00. 


Algorithm 1. Conjugate Gradient Algorithm 

In the above, (x.y) denotes the inner product jJl. 

This algorithm can be implemented on an array computer with p 

processors like the FEM by partitioning the K matrix by rows into p 

portions, where each portion consists of at most [— ] rows. The vectors 

P 

a.. L- £. and f_ are likewise partitioned by rows in the same manner 
The ith portion of each data structure is assigned to processor i as 
illustrated below for p=3. 
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Figure 1. Data Assignment to 3 Processors 

An examination of the CG algorithm as described above leads to the fol- 
lowing observations: 

(1) Once a is known, all processors can calculate their portion of 

k -t-1 

U. simultaneously with no communication required. 

(2) Once 0 is known, all processors can calculate their portion of 

B simultaneously with no communication required 

(3) Once Kb. and a are calculated, all processors can calculate 

k + 1 

their portion of £_ simultaneously with no communication 
required. 

(4) Some components of b residing in other processors will be 

needed for the calculation of Kb This means that the values of 

£ for the non-interior nodes must be communicated between 

u 

processors. This corresponds to the communication of the b 
values during a Jacobi or Multi-color SOB iteration as described 
In Chapter 5. 

(5) The calculations of a and 0 require inner products to be formed 
globally over the array computer. Each processor can calculate 
the partial sum that corresponds to its portion of rows, but these 
partial sums must then be added together. if each processor 
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were to broadcast its partial sum to every other processor, the 
number of values received by a single procesor is O(p-l) for 
one Inner product alone. 

This aspect of the CG algorithm was realized by Jordan[1979] to 
be detrimental to the performance of CG on an array computer 
and as a result the sum/max hardware circuit discussed in 
Chapter 3 was designed for the FEM to perform sums over the p 
processors. With this hardware, one inner product can be per- 
formed in 0(log 2 p) operations since each processor will load its 
partial sum onto the circuit, and the circuit will perform the sum 
In a binary tree fashion and then return the complete sum to 
each processor. 


6.2. Preconditioned Conjugate Gradient Methods 


6.2.1. The PCG Algorithm 


The condition number of any nonsingular matrix K with respect to a 
given norm Is 

<oo= ||k|| ||k -1 || (6d 


In particular. If K is symmetric with eigenvalues X ; then in the spectral 
(l.e. /g) norm 


< 00 = 


m “N 

min |Xj j 


( 6 . 2 ) 


The standard analysis of the conjugate gradient method. Chan- 


dra(19781. shows that the error in the / th iterate is bounded by 



|| W || 2 < II 2 


m 


(6.3) 

where a= ^o 

This bound shows that the error is a decreasing function of the condi- 
tion number of K. Hence, the conjugate gradient method applied to a 
system Kul=L where k(K)<k(K) will converge in fewer steps than the 
conjugate gradient method applied to Kn=L- This observation is the 
motivation for the preconditioned conjugate gradient method Instead of 
solving Ku.-L. we choose to solve 

A A 

Kii=L (6.4) 


where 

A -I -T 

K=Q 'KQ 

A 7 

ii=Q u 

A -I 

1=0 1 

and 0 is a nonsingular matrix chosen so that k(K)<k(K) Since 0 is 
nonsingular, we can define 

M=0Q 7 (6.5) 


and M will be symmetric and positive definite. In terms of M . K can 
be written as 

K=Q T M -\q~ T (6.6) 

A -1 

from which it can be seen that the eigenvalues of K and M K are the 
same. The introduction of M into the expression for K allows the stan- 
dard conjugate gradient algorithm to be written for the solution of tj_ 
directly in terms of M without explicitly forming Q This algorithm is 
described in Chandra[1978] and is given below 
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( 1 ) Choose u. 


(2) £° -L-K!L 


(3) ML=JL 


o o 

(4) a =L 


(5) lc=*0 


(6) For k=0. 


max 


.'k k. 

{£_ x. > 

(1) a= \ ^ -r- 

fc+1 k ^ k 

(2) SL =LL +aa 


II ^ | 

(3) II [ I it. - a I ^<6 then stop, otherwise continue. 


... k +1 k k 
( 4 ) jr =L -aKp. 


( 5 ) Ml =L 


(6) >3= 


.'k +1 fc +1 . 
(r ) 

* . 

<JL -L ) 


fc + l Atl - fc 
( 7 ) a =l + 00. 

Algorithm 2. Preconditioned Conjugate Gradient Algorithm 


The only difference in the implementation of Algorithm 2 and Algo- 
rithm 1 is the solution of a system of the form Mr_=x during each itera- 
tion. The considerations in choosing an M and in implementing the 
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corresponding system on a parallel computer are discussed in the next 
section 

6.2.2. Implementation of Preconditioners 

The preconditioned conjugate gradient algorithm of the last section 
requires a symmetric and positive definite preconditioning matrix M to be 
specified or computed. The question arises as how to choose M so 

* T -T -T 

that the condition number of K=<? M KQ 

max 


A -1 

where X. are the eigenvalues of K. or equivalently M K, is as small as 
possible. 

The best choice for M in the sense of minimizing k(K) is M=K but 
this gains nothing since k£=jl is just as difficult to solve as Ksi=L The 
approach that has been taken in the literature is to choose M to be a 
symmetric and positive definite approximation of K. if we write K as 

K=M-R (6.7) 

then 

(6 8) 

where R can be regarded as a remainder term Concus. Golub, and 
0'Leary[1976] give the following three criteria for M to be an effective 
preconditioner 

(1) Ml=£. is easily solved 

(2) M has small or nearly equal eigenvalues, or 
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(3) M -1 fl has small rank 

A fourth criteria that is a major consideration on a parallel computer is 

(4) M is easily formed. 


One class of preconditioners. Incomplete Cholesky Conjugate Gra- 
dient, (ICCG), (see Manteuffel[1979] for example) chooses M to be an 
incomplete Cholesky factorization of the matrix K. That is, M=LL T where 

K = LL T - R (6.9) 

and L is a lower triangular matrix and R is the remainder term. The 
matrix L in (6 9) and hence the matrix R will vary as different rules are 
used to create the Incomplete factorization. For example, one rule may 
restrict L to have the same sparsity structure as the lower part of K , 
whereas, another rule may allow fill— In within the band in some special 
fashion. In either case the system m£=£. will be solved by forward and 
backward substitutions on the triangular systems 


Li=r 


L 


T* 

L=iL 


( 6 . 10 ) 


respectively. 

The formation of M and the solution of the systems in (6 10) can 
be easily implemented on a sequential computer; however, an efficient 
parallel implementation on an array or vector machine may be difficult to 
devise. In particular, the formation of M as an incomplete Cholesky 
factorization may be difficult to implement in parallel In addition, if L 
does not have a special structure, the forward and backward substitutions 
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will be inherently sequential processes although Sameh and Kuck[l978] 
and van der Vorst[1981] have discussed the parallel solution of triangular 

systems and we address this Issue In more depth later in this section. 

However, in general, tridiagonal and banded matrices are not well suited 
for preconditioning matrices for a conjugate gradient method to be 
implemented on parallel computers. 

Another class of preconditioners that appears to be more easily 
implemented on parallel computers arises by choosing M to be a split- 
ting of K that describes a linear stationary iterative method. As a first 
example, let D be the diagonal or (block diagonal) of K and choose 
M=D. We note that In most cases M=0 will not closely approximate K . 

Furthermore, the choice M-D corresponds to a diagonal (block diagonal) 

scaling of K. That is. 

K=D~ 1/2 KO' 1/2 

A "“1/2 

Sl=0 n (6H1 

A — 1 10 

1=0 1 

and in practice this scaling would be done a priori and Ml-L would not 
be solved on each iteration. That is, the standard conjugate gradient 
method would be applied to (6 11) each iteration. 

As a second example of a preconditioner that arises from a splitting 
for an iterative method, consider the SSOR splitting of k£=£ which is 

2 

-^-K^-O-DO" 1 (^D-U) l£ = J [ (1 ~ w) 0 -K1 -u>) (LtU )+<jLD~ ^ U l£t£.(6. 1 2) 

2-uj <jj (J 2 -(jJ id 

where 0. -L, and -U are the diagonal, strictly lower, and strictly upper 

*(o) 

parts of K respectively. If we choose l. =0 and take one step of the 
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A a(1) 

SSOR method applied to Kl s l- the resulting r. will be the exact solu- 
tion to the system w£=r where the matrix M is given by (6.13). 

M = J±-i±D-L)D^ ll-O-U) (6.13) 

2-u < j ) cv 


We now consider the parallel implementation of the solution of Ml=£. 
when M is given by (6.13). If the matrix K is ordered by the Multi- 
color ordering, then the solution to the triangular systems 


(-Id -L )*. 

w 



(6.14) 


1 T* 

(-D -L ) V = x 
u 

can be efficiently implemented on parallel computers as one Multi-color 


A AfQ) 

SSOR iteration applied to Kl=£. with initial guess l. =0. 


Systems like (6.14) can be solved as Multi-color SSOR implementa- 
tions even if -L does not have the same elements as K as long as the 
sparsity structure of (D-L) corresponds to some Multi-color ordering. 
We note that being able to solve these systems efficiently on an array 
computer would allow ICCG methods that require the factors of M to 
have the same sparsity structure as K or that correspond to some 
Multi-color scheme to be implemented on parallel computers provided an 
efficient algorithm could be found to do the incomplete factorization in 
parallel. 


We next show for Laplace's equation that the above implementation 
of the SSOR preconditioning matrix (with w=l.D=/) for a Multi-colored 
grid achieves more accuracy with less computation than an implementa- 
tion described by van der Vorst[1981] for a natural ordering of the grid. 
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Let l-E-F denote the lower triangular part of the matrix that results 
from a 5-star discretization of Laplace's equation where the grid is 
ordered by the natural ordering. Let the matrices -E and -F contain 
the first and second nonzero subdiagonals of the matrix K respectively. 
Then in its block trldlagonal form, the matrix K can be written as 



where the matrix F is partitioned into the nxn diagonal submatrices F f 
and the matrix E is partitioned into the nxn E j submatrices where E f is 
the lower triangular part of the symmetric tridiagonal matrix T which has 
been scaled to have unit diagonal. Recall, that the system of equations 
that must be solved each iteration to Implement the preconditioner is, 

U-E-F)(I-E-F) T £ = l (6.16) 

Now, van der Vorst suggests approximating the forward substitution 

(l-E-F)y. = i_ (6 17) 

or equivalently the partitioned systems 

by 

Lj = (/ +F . +e^ t . ) (£ 7 +F / y / _ 1 ) (6.18) 

where m terms of the series for U-E) 1 are taken and a similar 
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expression is found for the approximation to the back substitution 

T A 

U-E-F) x.=£. Therefore, his idea is to take enough terms to approxi- 
mate M given by (6.16) and at the same time produce a preconditioner 
for a natural ordering of the grid that is vectorizable without being cost 
prohibatlve. Simple operation counts show the following number of multi- 
plications are needed to implement this scheme. 

(1) n to calculate 


( 2 ) 


2n (m-l)-m (m + D+2 
2 

For m =2. this Is n 


for finding and /+£^+ ..+(£^) 7 . 

-2. For m =3. this is 2n-5. 


( 3 ) 2mn ~™ ~ m f 0r multiplying (/tE^t ..+E^) (Cy+F^,.,) 

For m =2. this is 2n-3. For m =3. this is 3n-6. 


The total number of multiplications for m=2 and m= 3 are given 
below in (6.19) 


7 N - 10 VW 

m =2 

10A/ - 19 ^/N~ 

m =3 


(6.19) 


Now. if the grid points are ordered by the R/B ordering, the matrix 
will have the form 



Note that the matrix E in (6 16) is now so that the van der Vorst 
scheme in (6.18) reduces to 
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*1 = V F A-i 

A J A 

A A(Q ) 

which is the R/B SSOR iteration on the equation K£_=l with x. The 
number of multiplications required for this Multi-color SSOR implementa- 
tion is found from (5.29) to be at most 4/V Hence, by ordering the 
grid in a R/B fashion. 0(3/V) and 0(6A/> multiplications can be saved 
over the van der Vorst 2 or 3-term implementation respectively for the 
natural ordering. In principle, the van der Vorst scheme is more gen- 
eral since it can be applied to block matrices K regardless of the ord- 
ering of the unknowns, but the more dense the matrices T jt the more 
expensive the scheme will be. We also note that the m-term approxi- 
mation to (/-E^) -1 in (6.18) is not necessary if the grid is ordered R/B 
(also true for Multi-colored grids) since E. =0 for all /. This means 
that the solution to (6 16) is exact for the R/B ordering, whereas, it is 
only approximate for the natural ordering whenever m<n + 1 In addition, 
even if an exact solution to (6.16) could be obtained with a small value 
of m. (say 2 or 3), the number of iterations of the PCG method with 
the resulting preconditoner would have to be 0(1 75) or 0(2.5) times less 
(for m- 2.3 respectively) than the number of iterations with the R/B PCG 
method to compensate for the Increase In the computational work. 

6.2.3. m-step PCG Methods 

6.2.3. 1. Description 

It was demonstrated in section 6.2.2 that taking one step of a linear 
stationary Iterative method such as Jacobi or SSOR applied to k£=x. with 
*/ 0 ) 

£_ =0 results in a preconditioner for the conjugate gradient method that 


for / =1.2 
for / =1.2 
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can be implemented on a vector or array computer. The question now 
arises whether It would be beneficial to take more than one step of a 
linear stationary Iterative method to produce a preconditioner M that 
more closely approximates K. 

We begin by deriving an expression for M. Let K=P-Q be a split- 
ting of K that is associated with the linear stationary iterative method 
with iteration matrix G=P Then the m-step Iterative method applied 

to k£=x_ is 

P(/tGt...tG m “ 1 )' 1 £ (m) = [P(/+G+...+G m-1 ) -1 -(P-O)l£ <0) tL (6.20) 

*/Q) 

By choosing i_ =0. (6.20) becomes 

P(/+G+...+G /n " 1 )“ 1 £ </n) = l (6 - 21) 

Hence, the preconditioning matrix is 

M = P(/+G+...tG m_1 ) _1 (6 22) 

Now. M must be symmetric and positive definite to be considered as a 
preconditioner for the conjugate gradient method. Before we establish 

the necessary and sufficient conditions for M to satisfy these criteria, we 
prove the following lemma 

Lemma T_. 

if A =8C Is a symmetric positive definite matrix. B is symmetric, 
and C has positive eigenvalues, then B is positive definite. 


Proof- 


Let C 


-1 


2L=\i. 


or equivalently. 



Multiply both sides by A 


to get 


1/2 


or 


<A- 1/2 8A- 1/2 )(A 1/2 x) = XA 1/2 X 


(6.24) 


The proof is now by contradiction. Assume that B has a non- 
positive eigenvalue. Then, since (6.24) is a congruency 
transformation of B. It follows that R has a nonpositive eigen- 
value (see Gantmacher(19593). But the spectrum of R is identi- 
cal to that of C’ 1 and by hypothesis can not have a nonposi- 
tive eigenvalue. Hence B is positive definite. 


The necessary and sufficient conditions for M to be positive definite are 
given In Theorem 1. 


Theorem T 

Let K=P-Q be a symmetric positive definite matrix and let P be 
a symmetric nonsingular matrix. Then 


(1) the matrix M of (6.22) is symmetric. 

(2) for m odd. M is positive definite if and only is P is posi- 
tive definite. 

(3) for m even. M is positive definite if and only if P+Q Is 
positive definite. 
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Proof . 

To prove symmetry, we write M -1 as 

AT 1 =P _1 +p" 1 qp~ 1 +p _1 qp -1 op _1 +„.+p" 1 qp" 1 q . . - p" 1 (6 25) 

m - 1 terms 

Now since P and K and hence Q are symmetric, each term in 
(6.25) Is symmetric. Thus M ^ and therefore M are symmetric. 

J_ l 2 1 

The matrix G=P can be expressed as G=K 2 (/-K 2 P _1 K 2 )K 2 . 
Since P 1 is symmetric with P. the eigenvalues of the 

l/o _1 I/O 

congruence transformation K P K are real. Hence, the 
eigenvalues of G are real. 


To prove (2), let m be odd. If g is any eigenvalue of G 
other than 1, the corresponding eigenvalue of 


R=a +G+...+G m ~ 1 ) 


is 


m 


1+g +...+g 


m-1 _ 1-g 
1 ~Q 


which Is positive since m Is odd. If g = l, the corresponding 
eigenvalue of R Is equal to m which is also positive. 

Now. since P=MR and M Is symmetric and R has positive 
eigenvalues. It follows from Lemma 1 that if P is positive defin- 
ite then M must also be positive definite. Conversely, M can 
be written as M=PR ^ . Since R ^ has positive eigenvalues and 
P Is symmetric, we conclude from Lemma 1 that if M is posi- 
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tive definite then P is also positive definite. 

Next, to prove (3) let m be even. It is sufficient to consider 
Af -1 since any conclusions about the definiteness of Af -1 will 
apply to Af. Since m is even. Af” 1 from (6.22) can be written 
as 

Af" 1 = p” 1 (P+PG+PG 2 +PG 3 +...+PG ,n " 1 )P” 1 
or 

Af” 1 =P -1 [(P+PG)+(P+PG)G 2 +(P+PG)G 4 +...+(P+PG)G m ” 2 )]P _1 

Now, since PG-Q. M 1 can be written as 

Af -1 =P _1 (P+0)(/+G 2 +G 4 +...+G m_2 )P _1 (6.26) 

Now. since P is nonsingular and symmetric. Af 1 is positive 
definite if and only if the symmetric matrix 

S=(P+0)(/+G 2 +G 4 +. .+G m “ 2 ) (6.27) 

is positive definite 

Assume P+O is positive definite. Since S is symmetric and the 
2 4 rt) “2 ”1 

matrix (Z+G^+G +...+G > has positive eigenvalues. S is 

positive definite by Lemma 1. 

Conversely, if S is positive definite, since P+O is symmetric 
and the series /+G 2 +G 4 +...+G m 2 has positive eigenvalues. P+O 
is positive definite by Lemma 1. 
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Dubois. Greenbaum. and RodriqueC1979] consider a truncated Neu- 
mann series for K 1 as a preconditioner This preconditioner Is 
equivalent to that of (6.22) if K=P-Q corresponds to a Jacobi splitting 
where P=dlag(K). but they do not consider more complicated splittings 
that result from iterative methods. Theorem 1 extends their main result. 
Under the hypothesis that K and P are both symmetric and positive 
definite matrices and p(GXl. they prove that M Is symmetric and posi- 
tive definite for all m Note that for odd m the condition that p(GXl 

is not needed and for even m. the matrix P is only required to be 
symmetric. The relationship between the condition p(GXl and the posi- 
tive definiteness of P+O is given later In this discussion in Theorem 2. 

Theorem 1 Is helpful In choosing a splitting of K that will produce 
an m-step preconditioner that is symmetric and positive definite For 
example, if the Jacobi splitting of K ( P=D and Q=D-K where D is the 
diagonal of K) were considered, part (3) of the theorem says that if m 
is even, P+0 must be positive definite. We know from the Jacobi Con- 
vergence Theorem (see. eg. Youngn97ll). 

Jacobi Convergence Theorem 

Let K -P-Q be a real, symmetric, and nonsingular matrix with 
positive diagonal elements Then the Jacobi method converges 
(p(GXl) if and only if both P-Q and P+O are positive definite. 

that P+O and hence M will be positive definite only if the Jacobi 
method is convergent. For the problems of interest to us, the Jacobi 
method is not guaranteed to be convergent since we only know that K 
will be symmetric and positive definite. Therefore, for these problems, 
only odd values of m will yield m-step Jacobi preconditioning matrices 
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that are guaranteed to be positive definite. 

For any splitting satisfying the hypothesis of Theorem 1, the question 
arises whether the same relationship exists between the positive definite- 
ness of P+Q and the convergence of the iterative method with iteration 
matrix P that is given in the Jacobi Convergence Theorem above for 
the Jacobi splitting. We answer this question with Theorem 2. 

Theorem 2. 

Let K=P-Q be a symmetric positive definite matrix and let P be 
symmetric and nonsinguiar. Then p(P ^QKl if and only if 
P+Q is positive definite. 


Proof: 


First, assume P+Q is positive definite. Since K is symmetric 
positive definite and P is nonsingular. K=P-Q is a p-regular 
splitting. Hence, from Ortega's p-regular splitting theorem. 
Ortega(1971], p(P _1 Q>< 1. 

We next note that G=P ^Q can be expressed as 

G=K -1/2 U -K 1 / 2 p“V / 2 )K 1/2 
1 

1/2 -1 ~2 

and the matrix K P K has real eigenvalues since K is sym- 
metric positive definite and P is symmetric. Hence, G has real 
eigenvalues. 

Now, assume that p(GXl. then (/-G) 1 exists and since G has 
real eigenvalues, it easily follows that the matrix H defined by 
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H=(/-G) -1 (/+G) (6 28) 


has real eigenvalues. But we 

know 

from a theorem by 

Young(1971] that H Is N-stable. 

Hence 

H has positive eigen- 

values. Now. we can write H as 



H=K _1 (P+O) 


(6.29) 

or equivalently. 



K=(P+0)H _1 


(6.30) 

Finally, since K is symmetric and 

positive 

definite and H 1 has 


positive eigenvalues and P+0 is symmetric, we conclude from 
Lemma 1 that P+Q is positive definite. 

Note that the requirement that P be symmetric is stated as a suffi- 
cient condition but not a necessary one. It remains to be proven 
whether or not the symmetry of P Is necessary for M to be symmetric 
for odd m>1 or for P+O and hence M to be symmetric and positive 
definite for even m However, in practice, it was observed that the 
number of Iterations for convergence of the m-step PCG method with a 
nonsymmetric matrix P was extremely more than the number required by 
the standard conjugate gradient method (m=0) in particular, we solved 
the 60x60 plane stress problem which has a symmetric and positive 
definite coefficient matrix K with an m-step R/B/G SOR preconditioner. 
From the SOR convergence theorem stated in Chapter 5. we know that 
p(GK1. but the SOR splitting matrix with w=1 for simplicity is P=D-L 
and Is not symmetric. The results are given in Table 1 
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m m - step R/B/G. SOR 

0 49 

1 200 + 

2 200 + 

3 200 + 

4 200 + 

Table 1. Number of m-step R/B/G SOR PCG Iterations 

60x60 Plane Stress Problem 

These results indicate that only symmetric splittings should be considered. 
In the next section, we include results of the m-step PCG method 
derived from the Jacobi and SSOR splittings which are both symmetric 

6.2. 3. 2. Analysis of the Condition Number 

In the last section, we gave conditions for M to be symmetric and 

positive definite and hence to be considered as a preconditioner for the 

conjugate gradient method In this section we determine if increasing m 

will m fact produce a better conditioned system. For this purpose, we 

now denote by M the matrix of (6.22). 
m 

As a first step towards answering this question, we derive an 

expression for k(K ) Recall from (6.6) that K is similar to M so 
r m m 

that is the same as the ratio of the largest to smallest eigen- 

values of M^K. An expression for M^K as a polynominal in G is 

M _1 K = (/+G+...tG ni " 1 )P _1 (P-O) (6.31) 

m 

or 

= l-G m 
m 

where G=P ^Q. 
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Since we wish to compare k(K ) to k(K ^,). we will assume that both 

r m m +1 

M m and M ^ are symmetric and positive definite. By Theorem 1. this 
implies that P and P+0 are positive definite and thus by Theorem 2. 
p(G)<l. Therefore, since the proof of Theorem 1 showed that the 
eigenvalues X ; of G are real, they can be ordered as 

-1<\.<\_<- • -<X <1 
12 n 

Furthermore, let 0 be the eigenvalue with the smallest absolute value 

Then the condition number of K is 

m 


k(K) 

m 



\^>Q or X.j <0 and m odd 
X^<0. \ n > |X^| ,m even 
X.j<0, |x n | > |x n | .m even 


(6.32) 


As can be seen from (6.32), the conditions for <(K ^ ,)<k(K ) depend 

upon the distribution of the eigenvalues X. of G. These conditions are 
given by Theorem 3 if X^>0, and by Theorem 4 if 

X.j<0 and \ n > | X 1 1 We note that (6.32) shows for both odd and 

even m that if X^O and jx^ > | X n | it is impossible to conclude 

if K(K m+1 )<ic(.K m ) without knowledge of the values of X-j. X n , and 0. 


Theorem 3. 

Let K=P-Q and P be symmetric and positive definite with 
p(GXl. Then if X^>0. K(f< m ) ,s a decreasing function for all 
m . 
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Proof - 


We must show that jcO< . )<k(K) By (6 32), 

m 1 1 m 


k(K) = 
m 


A m — *J 

(l-X 1 Xl+X 1 + X 1 +...+X 1 ) 

(1-X )(l + x„txft...+x" , “ 1 ) 
n n n n 


(6 33) 


Hence, we must show for X^>0 that 


l+X 1 +X^+...+x'J 1 ^ HX n tX^t.»tX^ 


l+X.j+X^+...+X^ 1 ltX n +X^+...+ X^ 1 


(6.34) 


This inequality Is true since X n >X.j and 


1 (x) = 


1+X-Ht 2 +...+x m 

l+x+x 2 t...+x m ~ 1 


(6.35) 


can easily be shown to be an increasing function of x for x>0. 


As an application of Theorem 3 consider the SSOR splitting of a 

symmetric and positive definite matrix. Recall from the basis conver- 

gence theorem for SSOR that was stated in Chapter 5, that if K is a 
symmetric matrix with positive diagonal elements, the SSOR method con- 
verges if and only if K is positive definite and 0 <cj< 2. Therefore, 

p (G ) < 1 for the SSOR splitting and from Young [1971] we know that all 

the eigenvalues of G are real and nonnegative. Hence X^>0. To 

satisfy the last hypothesis of Theorem 3 we prove that the matrix P for 
the SSOR splitting is symmetric and positive definite. 
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Lemma 2. 

Let K=P-Q be symmetric and positive definite. If P is the 
SSOR splitting matrix, then P and M m are symmetric and posi- 
tive definite. 


Proof . 

Now, 

P= J±4D~^ /2 <.1-D-U)] T ID~^ /2 <Xd~U)] 

2-<j ui cj 

where D=dlag(K). and -U is the strictly upper triangular part of 

- 1/2 1 

K. Since the matrix D ( -0-1/ ) is upper triangular with 

< jJ 

positive diagonal elements and hence nonsingular, it follows 

Immediately that P is symmetric and positive definite. Therefore 
by Theorems 1 and 2. M m is also symmetric and positive 
definite for all m 

The results of the m-step SSOR preconditioned conjugate gradient 
method on the 60x60 plane stress problem problem are given in Table 

2. the results on the 1536x1536 plane stress problem are given in Table 

3. and the results on a 768x768 matrix derived from the 5-star discreti- 
zation of Laplace's equation are given in Table 4 For all three prob- 
lems, the results are given for both the natural rowwise ordering and 

Multi-color ordering of the grid. The convergence criterion was 

|| u. K+1 -ii <e. where €=10 -6 for all three problems. The standard 

conjugate gradient results with no preconditioning are indicated by m~ 0. 
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r/b/g 


Natural 

m 

# Iterations 


# Iterations 

# Iterations 


(w=1 ) 


(W=1 ) 

(w=1 2) 

0 

49 


49 

49 

1 

23 


20 

20 

2 

16 


15 

14 

3 

14 


12 

12 

4 

12 


11 

10 

Table 2 

ra-step SSOR PCG 

for 

60x60 Plane Stress Problem 


r/b/g 


Natural 

m 

# Iterations 

# 

Iterations 

# Iterations 


(CJ=1 ) 


(CJ=1 ) 

(w=1.6) 

0 

363 


363 

363 

1 

139 


111 

93 

2 

99 


80 

66 

3 

82 


65 

54 

4 

71 


57 

47 

Table 3. 

ra-step SSOR PCG for 1536x1536 Plane 

Stress Problem 


R/B 


Natural 

m 

# Iterations 


# Iterations 

# Iterations 


(<J=1 ) 


(<J=1) 

(W=1.8) 

0 

56 


56 

56 

1 

30 


28 

17 

2 

22 


21 

13 

3 

18 


17 

10 

4 

16 


15 - - 

9 


Table 4. m-step SSOR PCG for 768x768 Laplace's Equation 


The results in Tables 2.3. and 4 show that the number of iterations 


is a decreasing function of m as was predicted by Theorem 3. The 


results also indicate that there will be an optimal value of m. say m opt - 

since for m>m the reduction in the number of CG iterations is not 
opt 

enough to balance the increase in the number of iterations of the SSOR 


preconditioner For example, consider the R/B/G results in Table 3. 


The number of CG iterations and the number of steps of the SSOR 
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preconditioner as a function of m are summarized in the table below. 
The last two columns of the table give the total algorithm cost (in units 
of SSOR iterations) for the assumptions that one CG iteration is 
equivalent to one SSOR Iteration and that one CG iteration is twice as 
expensive as one SSOR iteration respectively. 


Iterations Total Cost 


m 

CG 

SSOR 

CG=SSOR 

CG»2(SS0R) 

0 

363 

0 

363 

726 

1 

139 

139 

278 

417 

2 

99 

198 

297 

396 

3 

82 

246 

328 

410 

4 

71 

284 

355 

426 


For this example. m=1 is optimal if one CG iteration costs the same as 
one SSOR iteration and m= 2 is optimal if one CG iteration is twice as 
costly as one SSOR iteration. The actual relative cost of the CG and 
SSOR iterations on a parallel computer will be a factor of the amount of 
artithmetlc and communication operations in each algorithm as well as 
the times to perform these operations on the machine. These issues 
will be discussed in more detail in Chapter 7 


We now prove Theorem 4. 

Theorem 4 

Let K -P -Q and P be symmetric and positive definite with 
p(G)<1. Then If \ n > |X.,j and X^O. 


m +1 


) < KtK). 
m 


(1) for m odd. k(.K 
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(2) for m even. .,) < <(K ) if and only if 

m +1 m 1 

(1+ jx^ m+1 )(l-X™) < (1-5 m )(l-X™ +1 ) 

Proof: 


By (6.32). we must show that 


1-0 

1-X 


m +1 
m +1 


1 + 


m 


1-X 


m 


(6.36) 


Since 

i-o m+1 


X n < 
< It 


1 and m+1 is even. (6.36) is true because 
i % i rn . . . m ^ 1 y . y m 

mi and '-k > 


Statement (2) of the theorem follows from (6.32) directly since 


k(K ..) < k(K ) can be written as 
m 1 1 m 


It 


mtl 


1-0 


m 


1-X 


m tl 


1-X 


m 


(6.37) 


, , , m tl . , . m . i, i mtl . , _m 
and 1-X„ > 1-X„ and It X, > 1-5 . 

n n III 


Observe from Theorem 4 that if X^> |x^| a better conditioned system 
will result by Increasing m from m (odd) to mtl (even), whereas, this 
may not be the case if m is increased from m (even) to mtl (odd). 

As an example of the application of Theorem 4. we consider the 
Jacobi splitting of any symmetric and positive definite matrix K that has 
Property A (see Young [1971]). For this splitting, P-D where D Is the 
diagonal of K and therefore P Is symmetric and positive definite. Now. 
since K has Property A, the eigenvalues of G occur in tx ; pairs and 
X n =-X^ and 6=0 From (2) of the theorem, we conclude that going 
from an even to a consecutive odd number of steps is advantageous if 
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and only If 

.,.,m+l w , . m. . ,, . m+1. 

(1+X n )(1-X n ) < <1-X ff ) (6.38) 

or equivalently, 

C +1 ~ 2X /i + 1 > 0 (6.39) 

As m increases the inequality in (6.39) reduces asymptotically to 

X n < j (6.40) 

For m =2 and m- 3. the exact conditions are X n <.62 and X fl <.53 respec- 
tively. But for problems of interest to us. \ n will be closer to 1 and 

we can conclude that it is not advantageous to Increase m from m 
(even) to m+1 (odd). This fact has been verified by numerical experi- 
ments for the m-step Jacobi preconditioner on an 89x89 symmetric and 

positive definite system that had Property A. The results are given in 

Table 5. 


rn 

0 

1 

2 

3 

4 

5 

6 

7 

8 

Table 5. m-step Jacobi 


# iterations 

45 

45 

23 

36 

21 

30 

18 

26 

16 

Results for 89x89 Problem 


Note from Table 5 that Increasing m from 2 to 3. from 4 to 5. and 
from 6 to 7 also Increases the number of iterations from 23 to 36. from 
21 to 30. and from 18 to 26 respectively. On the other hand, observe 
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that increasing m from an odd to a consecutive even number always 
reduces the number of iterations. Oubois. Qreenbaum. Rodrtque[1979] 
reported similar results for Poisson's equation but they explained the 
results by assuming that the eigenvalues of were near 0.1. and 2. 

Hence, our explanation given by Theorem 4 is more general. 

Theorem 4 suggested that In certain instances it is better to take 

an even number of steps of the preconditioner. If this were done, the 

question would be to determine the conditions for k(K .„) < *c(K ). 

m +2 m 

These conditions are given in Theorem 5. Notice that the hypothesis of 
this theorem only requires P to be symmetric and nonsinguiar instead of 
positive definite. 


Theorem 5. 

Let K=P-Q be symmetric and positive 

symmetric and nonsingular. If pfGKl 

K(X‘ ,) < «(K ). 

m Tc. m 


definite and let P be 
and m is even then 


Proof: 


From (6.32) we must show that 


1-0 


m +2 


1-0 


m 


, ,mt 2 

1-X 

T-X i 

l-8 mt2 . 

l-o 1 

>-c 2 ' 

i-x 1 

1 


m 

1 

m 


m 

n 


for 


for 


(6.41) 


We rewrite (6.41) as 
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i- loT * 2 < llh 


m +2 


nor i- h r 


i-ioi m+2 < H^l 

^ i ■ i ^ .» u 


m+2 


m 


for X. > X„ 
1 n 


for X n > |X,| 


(6.42) 


H«r h>„i 

Since 0 < l x li and 0 < I^nl ' (6,42) ,s true smce 


f (x) = 


1-x 


m +2 


1-x 


m 


(6 43) 


is an increasing function of x for x>0. 


Hence, if we always take an even number of steps of the preconditioner, 
a better conditioned system will result as the number of steps increases. 

So far we have only addressed the question of whether a better 
conditioned system results by increasing m We now turn to the ques- 
tion of how much improvement over m=l can be made by taking m> 1 
steps of the preconditioner Dubois. Greenbaum. and Rodriquell979] 
prove that the m-step PCG method can only reduce the number of 
iterations needed by the 1-step PCG method by a factor of m. that is. 


# iterations 1-sfep PCG 

# iterations m -step PCG 


(6.44) 


In practice, this theoretical bound may not be reached and lor a given 
distribution of eigenvalues it may be sharper for some values of m than 
for others. The results of Dubois, et.al. [19791 show this for the m-step 
Jacobi PCG for Laplace's equation Tables 2. 3. and 4 show for the 
m-step SSOR PCG method applied to both the plane stress problem and 
Laplace's equation that the bound is best for m= 2 Table 5 shows that 
for the m-step Jacobi PCG applied to a problem with Property A that 
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the bound Is extremely sharp for m - 2 and extremely poor for odd values 
of m . 

In order to determine the conditions under which the m-step PCG 
method gives the most improvement over the 1-step PCG method, we 

>c (K-j ) 

examine the ratio for both odd and even m with different 

kCK) 

m 

assumptions about the distribution of the eigenvalues X^ of G which are 
assumed to be ordered as -1<X 1 <X 2 < — w,th 6=min |Xj|. This 

ratio can easily be calculated from the equations of (6.32) and is sum- 
marized below for the various cases. 



„ , , 2 , m-1 

1+X„+X +...+ X 
n n n 


1 + X ^tX^+...+X^ 


"♦h 

. x 2 .m-1) 

)(1+X +X +...+ x_ 
i n n n 



-♦hr 

k(K 1 ) 

"♦hi 


r 

1 

S 

>•* 

* 

"♦hi 

"'"♦N> 


"♦hi 

i . , ,2 . m — 1 , 

)(1+X +X_+...+ X_ ) 
n n n 



(1- |o| m > 


"♦hi 

i>"-hl ro > 



( H1- |0| m ) 


X.j>0 


X 1 <0.X >0. m odd 
l n 


X^<0.X n <0. m odd (6 45) 


X 1 <0.X n > |x.j| m even 

X.j <0. |x^| > |X I m oven 


Several observations can be made from (6.45) and are listed below. 


(1) If X^O. the 


maximum value 


of 


k(K 1 ) 


k(K ) 
m 


occurs 


as 
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X?— >0 and X — *1 and is equal to m. 
1 n 


< (K 1 ) 

(2) If X, <0 and X >0. and m is odd, the maximum value of 

n «(K m > 

m 

1 + | X 1 1 

occurs when X- — >1 and is equal to m( •). 

t* , . u i m 


(3) The m-step PCG method (m> 1) is more effective if X >0. 

n 


(4) If X, <0. and X > X, . and m is even, the maximum value of 
1 n i 

k(K i ) 

— - — occurs when X- — >1 and | X ^ | = |X n | and is equal to 

kCI^ ) 
m 

2/tt 

— . Note that the larger 5, the larger this ratio will be. 

1-0 

Hence to achieve the maximum performance in this case, we 

would like the value of 5 to be as close to that of X 1 as 

possible. For K matrices with Property A. this is not possible 

since 0=0 and the maximum ratio of the two condition numbers 

is 2m . 


in summary, the m-step PCG method gives more improvement 
over the 1-step PCG method when an even number of steps of 
the preconditioner are taken and the eigenvalues of the matrix 
G are distributed as described in (4) above. 
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6.2.4. m-step Extrapolated PCG Methods 


6.2.4.I. Description 

It was pointed out in the last section that the m-step methods per- 
form better If the smallest eigenvalue X-j of G is negative and the larg- 
est eigenvalue \ p is positive with X n = j X ^ J . Furthermore, the maximum 
x(K 1 ) 

value of — - — was seen to be 2m if 0=0 and greater than 2m other- 

<(K) 

m 

wise. The purpose of this section is to demonstrate how to achieve this 
distribution of eigenvalues by using extrapolation. 

We begin by recalling that the iteration matrix H for an extrapolated 
iterative method can be written as 

H = (1-7)/ + 7 G (6.46) 

where G is the associated iteration matrix for 7=1 

This iterative method, corresponds to the splitting of K given by 

K = —P - (^2f» + Q) (6 47) 

7 y 

where 

K = P - Q (6 48) 

is the splitting of K that leads to the unextrapolated (7=1) method with 
iteration matrix G-P ^Q. 

If we define 

R = 1-P (6.49) 

7 


and 



s 
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1-7, 


t Q 


(6.50) 


the preconditioning matrix M for the extrapolated m-step PCG method 

*77 »/ 

is 


M = R (/ +H -1 ) _1 (6.51) 

m ,y 

The following Corollary gives the necessary and sufficient conditions for 
M to be symmetric and positive definite. 

#77 i y 


Corollary T 

Let K =P-Q be symmetric and positive definite and let P be 
symmetric and nonsingular. If y>0. then 

(1) is symmetric. 

(2) for odd m. is positive definite if and only if P is 

positive definite. 


Proof: 


(3) for even m, M is positive definite if and onlv if a <i 

m .y n 

2 

and 7 < y— — where the eigenvalues of G are g. <g n < ..<a . 
i y-j 1 s 2 n 


R ~7jP is symmetric since P is symmetric. It follows from 

Theorem 1. that M Is symmetric. 
m ,y 


Since 7>0. (2) follows from Theorem 1 since for odd m . is 
positive definite if and only if P is positive definite. 


To prove (3) we note from Theorem 1 that for even m M 

m ,y 

is positive definite if and only if flts is positive definite Since 
R is symmetric and nonsingular, we know by Theorem 3 that 
fl tS is positive definite if and only if p(H)< l. Therefore. 
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M is positive definite if and only If p((l- 7 )/+ 7 G)<l and this 
m ,y 

condition Is met if and only If l-7+7g n <l and -l<l- 7 + 7 g-| or 

2 

equivalently g <1 and 7 <-; and (3) follows. 

n l-g. 


The ratio of the largest to smallest eigenvalues of M 1 K where 

m ,7 

M - 1 K = l~H m (6.52) 

m. 7 

depends upon the distribution of the eigenvalues of H as was discussed 
in the last section and this distribution will be a function of y. How- 
ever. for the special case m=l, 

M~] y K =7 </-G> 

the ratio of the largest to smallest eigenvalues of K is independent 
of 7 and extrapolation is not worthwhile in this case. 


6.2.4.2. Choosing the Extrapolation Factor 

We would like to choose y so that the eigenvalues of H. 
h 1 </) 2 < — <h n satisfy 

| /» 1 | = h n (6.53) 

in order to achieve the most Improvement over the 1-step extrapolated 
(or unextrapolated for m=l) method. Since h.= l- 7 + 7 g / , (6.53) leads to 
the following choice for y: 

y opt ~ 2 -g 1 -g n (6.54) 

Note that if g 1 = ~Q n - T wi* 1 equal 1 and the matrix- G already has the 
optimal distribution of eigenvalues and hence no extrapolation will be 
performed. This will be the case for the Jacobi splitting for Laplace's 
equation. 
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Extrapolation is therefore most useful if all the eigenvalues g^ of G 

k(K 1 ) 

are nonnegative and from (6.45) the maximum value of Is 

<(K) 

m 


k (K 1 ) 

k(K ) 
m 


2m 


i-g 


m 


m 


g. >0, m even , extrapolation 


(6.55) 


g^ >0. no extrapolation 


To illustrate how effective extrapolation can be for m> 1 we consider 

the m-step SSOR PCG method. The eigenvalues of the SSOR iteration 

matrix G are nonnegative and p(G)< 1 for symmetric and positive definite 

matrices K so the hypotheses for m even in Corollary 1 are met if we 
2 


take y< 


1_g l 


The plane stress problem and Laplace's equation were 


solved with the m-step extrapolated SSOR PCG method for the Multi-color 
orderings of the respective grids. The results are given in parenthesis 
in Tables 6 and 7 respectively. 


m 

R/B/G 

Natural 


Cd=l 

(jj— 1 

w=1.6 

0 

363 

363 

363 

1 

139 

111 

93 

2 

99 (72) 

80 

66 

3 

82 

65 

54 

4 

71 (59) 

57 

47 


Table 6. m-3tep SSOR (Extrapolated SSOR) 
1536x1536 Plane Stre33 Problem 
(7=1.95) 
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m 

R/B/G 

Natural 


W=1 

td=1 

W=1.8 

0 

56 

56 

56 

1 

30 

28 

17 

2 

22 (17) 

21 

13 

3 

18 

17 

10 

4 

16 (14) 

15 

9 


Table 7. m-step SSOR (Extrapolated SSOR) 
768x768 Laplace's Equation 
(7=1.7) 


Note that for both the plane stress problem and Laplace's equation, the 
extrapolated method for the Multi-colored grid required fewer iterations 
for convergence than the corresponding unextrapoiated method but still 
required more Iterations than the unextrapoiated method applied to the 
natural ordering of the grid with optimal relaxation factor. The ratio of 
the number of iterations for the 1-step method to the number of itera- 
tions for the m-step method is given In Table 8 for the plane stress 
problem and in Table 9 for Laplace's equation 


m Unextrapoiated 


Extrapolated Theoretical Maximum 


2 1.40 1.93 

4 1.96 2.36 


2.00 

4.00 


Table 8. Ratio of 1-step to m-step R/B/G SSOR PCG 
Plane Stress Problem 
(7=1.95) 


1.36 1.76 2.00 

1.88 2.14 4.00 

Table 9. Ratio of l-step to m-step R/B SSOR PCG 
Laplace's Equation 
(7=1.70) 


2 

4 
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Tables 8 and 9 show that for both problems, the extrapolated method 
with m= 2 gives results closer to the theoretical maximum than does m= 4. 

The implementation of the extrapolation method takes little extra 
computational effort each iteration but does require the storage of an 
auxilary vector of length equal to the number of unknowns. However, 
the major consideration in the use of extrapolation Is the determination 

of the extrapolation factor y. As is seen from (6.54). the optimal value 
of y depends on prior knowiege of the largest and smallest eigenvalues 
of G which may not be known in practice. 

6.2.4.3. Comparison to the PPCG Method 

Johnson. Micchelli. and Paul (1982J have suggested symmetrically 

scaling the matrix K to have unit diagonal and then taking m terms of 

a parametrized Neumann series for = (/-G) 1 as the value for M 1 . 

m 

They call the resulting method the PPCG(m-l) method to mean m-step 

Parametrized Preconditioned Conjugate Gradient Method. This 
corresponds to a preconditioning matrix that is a polynominal of degree 
m - 1 in G. 

M 1 = a /+a,G a 0 G 2 +..+a ,G m 1 (6.56) 

m o 1 2 m -1 

derived from the Jacobi splitting 

K = / - G (6.57) 

and the solution to M u f~r can be Implemented by taking m steps of 
the Jacobi iterative method applied to k£=£. with initial guess r (o) =j2.. 
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Since K.l. and hence 6 are symmetric, clearly M m is symmetric. 
Now. can be written as a polygominal in K. 

M~\=la /+a,(/-K)+a 0 (/-K) 2 +...+a .(/-K) m_1 ]K (6.58) 

and Johnson, et.al.. guarantee that M m will be positive definite by 
choosing the a^'s so that the eigenvalues. y(X). of and hence 

those of M m are positive on the interval IX-j.X^] that contains the 
eigenvalues of K. Hence, the idea of the parametrlzation Is to choose 
the cys so that the eigenvalues of are positive on IX ^ . X^] and 

are as close to 1 as possible in some sense such as the min-max or 
the least squares criteria. 

We now show how to generalize this idea for any splitting of the 
matrix K. If we let 

K = P - Q (6.59) 

and G=P ^0 then from (6.22), the Inverse of the m-step preconditioner 
is 

M -1 = (/+G+G 2 t...tG /n_1 )P _1 (6.60) 

m 

We parametrize this series as. 

M -1 = (a_+a, G +a«G 2 +...ta ,G m-1 )P -1 (6.61) 

m .at o l d m - 1 

and note from Theorem 1 that M m a will be symmetric if P is sym- 
metric since the a's do not affect the proof of symmetry. 

The expression for K is given by 
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M" 1 K=la +a,(/-P _1 K)+. +<* 1 </-P” 1 K) /n “ 1 JP“ 1 K 

m .a o 1 m-1 


(6.62) 


and is seen to be a polynominai in P ‘/< rather than in K as in (6.58) 


This means that the values of a f should be chosen so that the eigen- 
values y(X) of M ^ K are positive on the interval [X,.X I that contains 
m ,<x in 

the eigenvalues of P 1 K and are as close to 1 as possible in some 
sense such as the min-max or least squares criteria. 

We now consider the special case of m= 2 in equation (6.61) for the 
general splitting K=P-Q. The matrix a is then 


M 


2. a 



0 -1 
( — / + fl ) 

a l 


(6.63) 


and the 2-step extrapolated preconditioner matrix for the same splitting 
of K is seen by (6 51) to be 




+ G ) 


-1 


(6.64) 


It is known, see Chandra{19781. that the same iterates of the PCG 
method will be obtained with M and any positive constant multiple of M 
Hence, the extrapolated preconditioner of (6.64) yields the same results 
as the parametrized preconditioner of (6 63) If 


2-y _ a o 


(6.65) 


or equivalently. 


y = 


a /a,+l 
o 1 


(6 66) 
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We note that for m >2 such a relationship will not exist between 

M and M and more research will be required to determine if 
m .a m . y 

the parametrized preconditioner is better than extrapolation for these 
cases. 



CHAPTER 7 


Parallel Algorithm Analysis 


7.1. Execution Time Model _ 

The method of comparing algorithms for serial machines is the 
standard complexity analysis of the number of arithmetic operations 
required for completion of the algorithm. For iterative methods this can 
be broken Into the number of operations per iteration times the number 
of Iterations necessary for convergence. 

It has been pointed out repeatedly in the literature, see Ortega and 
Voigt{1977J 8uzbee(1978], Qrosch[1979|. Jonesn980J. Hockney[1982J. for 

examples, that this standard complexity analysis Is not sufficient to com- 
pare parallel algorithms. Additional factors such as data transmissions 
between processors, processor synchronizations, and global decision mak- 
ing among processors add to the execution time of a parallel algorithm. 
The number of these overhead operations vary with each algorithm and 
the number of processors used to solve the problem, in addition, the 

time required per operation may be a function of the number of proces- 
sors as well as the hardware/software implementation of the operation on 
the parallel machine. These considerations suggest that the analysis of 
a parallel algorithm's performance on a particular machine should Include 
a model for its execution time. 

In this chapter an execution time model Is developed for analyzing 

the parallel algorithms in Chapters 5 and 6. The number of arithmetic 

operations, data transmissions, synchronizations, and flag checks per 
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Iteration are multiplied by the total number of Iterations to yield the 

number of operations of each type. These numbers are then multiplied 
by the time cost for the respective operation to obtain the total execution 
time. This execution time Is measured In units of one 

multlplicatlon/addltlon pair. A detailed description of the model follows: 

Let 

a = number of multiplication/addition pairs per iteration 

b = number of barrier synchronizations per Iteration 

c = number of colors 

d = number of divisions per iteration 

e = number of equations of each color per processor 

f = number of global flag checks per Iteration 

g = number of global transmissions per Iteration 

/ = number of interior equations per processor 

m = number of steps of m-step PCG 

p » number of processors 

r = number of receives per Iteration 

s = number of sum/max circuitry uses per iteration 

t = number of local transmissions per iteration 

v = number of local convergence tests per Iteration 

t j ® maximum number of nonzero entries per row of K 

/ = number of iterations required for convergence 

N = number of equations to be distributed to p processors 

The following are the costs of each operation in units of 
one single precision floating point multiplication/addition 
pair. 
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B = cost per barrier synchronization 

D = cost per division 

F » cost per global flag check 

G = cost per global send 

R = cost per receive (both local and global) 

S = cost per sum/max usage 

T = cost per local transmission (send) 

V = cost per local convergence test 

The formula for the execution time GE) Is given by 

E = Ua+dDfvV+rRHT+gG+bB+fF+sS] (7.1) 

This formula can also be used to determine the execution time of a 
sequential algorithm by setting r.t.g.b.f, and s to zero. 

The values of v.b. and f are the same for all the iterative algo- 
rithms considered in Chapters 5 and 6. A description of how these 

values are determined will now be given. The value of ||^~ 
must be determined at the fcth iteration. If this value is less than a 
prescribed tolerance €. the Iteration terminates. If not. the next iteration 

Is begun. This estimate is determined in two steps. First, each pro- 

k k 

cessor compares Its portion of u. ■ say with the corresponding por- 
tion 1 obtained on iteration *-l. If ||i^“ if£ -1 <€ the proces- 

sor raises its convergence flag. For a sequential algorithm this conver- 
gence criterion requires N comparisons each Iteration; whereas, for a 
parallel algorithm, an equal partitioning of u to the p processors allows 
these comparisons to be performed simultaneously In the processors. 
Hence the value of v is given by 
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v = N (sequential ) 

v = N/p (parallel ) 


(7.2) 


We note that the complexity analysis of an algorithm on a sequential 
computer rarely includes the operation counts for the convergence test. 
However, if p=0(N). (7.2) implies that v=0(l) for a parallel implementa- 
tion. This can cause a significant reduction In execution time of the 
parallel algorithm if the convergence test Is a costly operation or if the 
number of Iterations is large. 

Secondly, the processors must be synchronized at the end of each 
Iteration for the purpose of checking the convergence flags of all the 
processors. That is. 


b = 1 
1 = 1 


(7.3) 


On the Finite Element Machine this synchronization is implemented as a 
barrier whereby each processor uses the signal flag hardware circuit to 
monitor the synchronization flag on all processors. When this flag Is set 
in all processors, the barrier is lowered and the processors continue 
with the next instruction which Is the global convergence test. To per- 
form this test, each processor uses the signal flag hardware to check 
the convergence flag In all processors. If all processors have set their 
flag, the algorithm terminates: If not. the next Iteration is begun. 

We note that other norms could be used to estimate the error. In 

k k - 1 T k k - 1 

particular, the 2-norm would require Qi ~IL ) (si -si ) to be formed. 
This would require N multiplications and N-1 additions (N additions if 
the sum Is initally set to zero) for a sequential algorithm For a paral- 
lel algorithm. N/p multiplications can be done simultaneously by the p 
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processors and these partial sums loaded onto the sum/max circuit. The 
circuit would calculate and then return the complete sum (inner product) 
to each processor. Since the sum/max circuit Is not yet operational on 
FEM and the actual programs run on FEM used the »-norm test, the 
2-norm will not be considered In the model. 

To determine / In (7.1). for the plane stress problem for example, 
let x and y denote the number of rows and columns of problem nodes 
assigned to each processor as shown in Figure 1 where lines between 
processors represent the local links that are used during computation. 



Figure 1. Problem Node Assignment 

Furthermore, let d represent the number of equations at each problem 
node, and assume that the grid is discretized by linear triangular finite 
elements so that each Interior node is on a common finite element with 
six other nodes (East. West. North. South. Northwest. Southeast). 
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The number of equations within each processor that correspond to 
the u. values that are not communicated to other processors during 
computation (Interior equations) is given by 


/ 


d (x-l)(y-l) 
d (x-2)(y-2) 


p <9 
p> 9 


(7.4) 


where we are assuming that the processors are connected in an array 
fashion like the FEM so that a completely interior processor does not 
occur until p- 9. 

The values of a.g.s.t and r In (7.1) depend on the particular Itera- 
tive algorithm used to solve the problem and will be discussed separately 
In the following sections for the Multi-Color SOR, the Conjugate Gradient, 
and the m-step (SSOR) Preconditioned Conjugate Gradient algorithms. 


7.1.1. Execution Time for Multi-Color SOR 

To solve Ku.-L by the Multi-Color SOR method given by Algorithm 3 
In Chapter 5, at most 7}-l multlplicatlons/additlon pairs for each of the 
N rows of K are required to produce the next iterate and 2 additional 
multlplicatlons/addltlons per row to do the over-relaxation. That is. 

a = A7 (7?+l) (7.5) 

If / represents the number of SOR Iterations, the total execution time for 
the sequential Multi-Color SOR algorithm Is given by 

E = /[/V(7?tl)+/VV] (7.6) 

Now suppose that p processors are available and that p evenly 
divides N. Then, the arithmetic In (7.5) Is divided by p to give 
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a = /V(7)+1)/p (7.7) 

Chapter 5 showed how to map problems on rectangular grids onto an 
array of processors so that the Multi-color SOR method only requires a 
given processor to communicate with at most eight of its nearest neigh- 
bors. For example, an interior processor will communicate with Its four 
nearest neighbors (North. South, East. West) during the solution of Laplace's 
equation if the region is discretized by the usual five-star discretization, 
and during the solution of the plane stress problem with the domain 
discretized by linear triangular finite elements, an interior processor will 

communicate with Its North, South, East. West. Northwest. Southeast, 
neighbors as shown In Figure 1. This means that the global bus and 
the sum/max circuitry are not required for these discretizations and 

g = 0 (7.8) 

s = 0 (7.9) 

The number of local sends for each processor will equal the number of 
non-interior equations: 

t = ce-i 


or equivalently. 


dOr+y-1) p <9 
d (2x +2y -4) p>9 

k 


(7.10) 


and the number of values received by each processor per iteration will 
be the number of non-interior equations for all six neighboring proces- 


sors: 


’2cf(xty+l) p> 9 

dOr+y+1) p <9 
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(7.11) 


Thus, the total parallel execution time for Red/Black/Green SOR for the 
plane stress problem Is 

E = / UN (77+l)+A/V)/p +fT +rR +8 +F] (7.12) 

where f. and r are given by (7.10) and (7.11) respectively. 


7.1.2. Execution Time for Conjugate Gradient 

To solve Kil=L by the conjugate gradient method given by Algorithm 

1 in Chapter 6. the following number of multiplications and additions are 

required: at most 7?A/ multiplications and 17 (A/ — 1) additions for forming 

Kb.. N multiplications and N - 1 additions for doing each of the Inner 

products b Kb and jljl. N multiplications and N additions for each of 
k k k k k+1 k 

the computations u. fap. . L -olKb . and i_ +0b respectively, in addl- 
ton. 2 divisions for the calculation of a and 0 are required. Hence, the 
total number of arithmetic operations per iteration is given by (7.13) 


a < t}N+ 5N 
d - 2 


(7.13) 


and the total execution time for the sequential conjugate gradient is 
bounded by 

E < / [17 A/ +5A/+2D +NV) (7.14) 


Now. suppose that p processors are available and that p evenly 
divides N. Then the number of multiplication/addition pairs in (7.13) is 
divided by p to give 
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a = (t}N +5N )/p 


(7.15) 


but the two divisions must be done by all p processors. 

The values of t.r.g. and s In (7.1) depend on the mechanism used 

7 7 

for doing the Inner products jljl and p Kp. We first give these values 
for the plane stress problem if a bus (such as the global bus on FEM) 
alone Is used. The values that each processor must communicate dur- 
ing each iteration of the conjugate gradient algorithm are the non- 
interior p values, one partial sum for Zl. and one partial sum for 
qJkq.. The non-interior p values are sent to the six neighboring pro- 
cessors for a total of ce-i transmissions as given by (7.10). In addi- 
tion. the partial sums for Zl and p T Kp are sent to the eight local 
neighbor processors and broadcast over the global bus to the remaining 
p-9 processors. The totals for t and g are given in (7.16) and (7.17) 
respectively. 


r = c e-i+2 


(7.16) 


g 



p <9 
p >9 


(7.17) 


The non-Interlor values of p from the six local neighbors must be 
received each Iteration. Also the two partial sums ZjL and p^Kp must 
be received from p-1 processors and added to the accumulating global 
sum. If q denotes the ratio of the time to receive and add one 
number to this global sum to the time to receive the number, the value 
of r Is given by (7.18). 


» 

= * 

m 


2d Gc+y+1) + 2(p -l)p 
dCr+y + 1) + 2 (p-l )<7 


p> 9 
p <9 


(7.18) 
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The total execution time for the parallel conjugate gradient algorithm that 
uses the global bus for Inner products Is given In (7.19). 

£ * I lti)N +5A/ +NV)/p +20 +tT+rR +gG +B +F] (7.19) 

where t. g. and r are given by (7.16), (7.17). and (7.18) respectively. 

Secondly, we assume that a special hardware circuit such as the 
sum/max circuit on the FEM Is available for performing the Inner pro- 

7 7 

ducts. Then the partial sums for £.£. and & Kq. are calculated simul- 
taneously in the processors and then loaded onto the sum/max circuit, 
summed by the circuit, and the result placed into a special receive 
buffer in each processor, in particular, the values of t. r. g. and s 
are: 


t = ce -/ 


(7.20) 

f2of (xty+1) 

p >9 


r = i 


(7.21) 

\ d (x+y+l) 

p <9 


o 

ii 


(7.22) 

s = 2 


(7.23) 


Note that the number of receives, r. is no longer 0(p) since the global 
bus is not used. The total execution time for the parallel conjugate 
gradient algorithm that uses this special hardware for the Inner products 
Is given in (7.24). 

£ = / KijAf +5W +A/V)/p +20 +fT +rfl t2S +8 +£] (7.24) 

where t and r are given by (7.20) and (7.21) respectively. 
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7.1.3. Execution Time for m-Step (SSOR) PCG 

The number of muitiplicatlon/addition pairs per iteration is equal to 

the number for standard conjugate gradient plus m times the number for 
one step of SSOR. Each step of SSOR can be implemented as dis- 

cussed in Chapter 5 and will require A/ ( 77 +D multiplications if either 
overrelaxation or extrapolation Is used. The total execution time for the 
sequential m-step SSOR PCG method is given in (7.25). 

E = 7?A/+5/V+20+A/V’+m/V(7?+l) (7.25) 

The addition of the m-step SSOR preconditioner to the standard 
conjugate gradient algorithm adds extra arithmetic and local send and 
receive operations. It is Important to note that the SSOR algorithm can 
be implemented as a forward and backward Multi-Color SOR method as 
described in Chapter 5 and hence no global communication between 
processors is required for rectangular grids. 

The number of multiplications, local sends, and receives required by 

the m-step preconditioner alone will now be described. The arithmetic 
in (7.25) due to SSOR Is distributed among p processors and is given 
by (7.26). 

a = m/V(7)+l)/p (7 26) 

The only communication that the m-step preconditioner adds to the PCG 

algorithm is the local communication of the r_ values during the m-step 

Iterative solution of )<£=£.. These values must be sent to the six neigh- 
bor processors twice for every step of the preconditioner, once for the 
forward SOR pass and once for the reverse pass as indicated by (7.27). 
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t = 2 m(ce-l) (7.27) 

Likewise, values of £ must be received from the six neighbor processors 
for both the forward and reverse SOR pass for every step of the 
preconditioner. Hence, the number of receives is given by 


*2cf(x+y+l)m p< 9 

- 

4d(x+y+l)m p> 9 

> 


(7.28) 


The total execution time of the parallel m-step PCG algorithm 
depends on the Implementation of the inner products. First. If the par- 
tial sums of the Inner products are communicated over the global bus. 
the execution time for the parallel m-step PCG method Is given by 

(7.29). 

E = C + / tm A/ (7J+1 Vp+fT+rfl] (7.29) 

where t and r are given by (7.27) and (7.28) respectively and C Is the 

execution time of the parallel conjugate gradient method given in (7.19). 

Secondly, we assume a special hardware circuit such as the 
sum/max circuit on FEM is used to perform the inner products. Then 

the execution time for the m-step SSOR PCG method is given by (7.29) 
where. In this case, C represents the execution time of the parallel 
conjugate gradient method given In (7.24). 

7.2. Model Validation 

To fully validate the model, problems would need to be solved on a 
p-processor array such as the Finite Element Machine so that an 
algorithm's execution time dependence on p could be determined. At 
this writing, only four processors are operational on the FEM: therefore. 
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the model can only be partially validated. For instance, algorithms that 
use only four processors will make use of the local communication links 
and the signal flag network but will not use the sum/max hardware or 
the global bus. 

The plane stress problem of Chapter 2 was chosen to validate the 
model on the 4-processor FEM. The plate was discretized into 50 tri- 
angular finite elements and the basis functions were chosen to be 
piecewise linear polynomials. As a result, the displacements were calcu- 
lated at the vertex nodes shown in Figure 2. 
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Figure 2. Four Processor Assignment 
Plane Stress Problem 


Furthermore, the plate Is constrained on the left edge so that the dis- 
placements at the six nodes along this edge are zero. The calculation 
of the displacements u and v at the remaining 30 nodes must be parti- 
tioned to the four processors. Processors 1 and 3 In Figure 2 are 
assigned 6 nodes, or 12 equations each; whereas, processors 2 and 4 
will each solve 18 equations. The speed of the 4-processor FEM Is 

then governed by processors 2 and 4 since more work is assigned to 
them and as a result we can consider each of the four processors to 
be solving 18 equations. Therefore, the best possible speedup for this 
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problem on the 4-processor machine Is 

||(4> = 3.33 (7.30) 

which corresponds to an efficiency of 83%. 

The parameters for the current 4-processor FEM that were used In 
the model were obtained from Tom Crockett at NASA and the arithmetic 
speeds were gotten from the specifications for the AMD 9512 floating 
point chlp[19791. 


1 mult/add pair 

- 844/13 



1 division 

- 472 ns 

or 

D-0.5592 

1 barrier 

- 185/tS 

or 

B-0.2192 

1 flag check 

- 156/15 

or 

F-0.1848 


In addition, the times to send and receive values on the local links were 
determined by Loendorf and Smlthn982] to be 

1 local receive - 1500/ts or R-1.7730 
1 local send - 1240 ) us or T=*1.4692 

and the time for a local convergence check was assumed to equal the 
time for a muitlplicatlon/additlon pair (V=l>. The speedups obtained on 
the 4-processor FEM for the 3-Color SOR and standard conjugate gra- 
dient algorithm as well as those predicted by the model are given in 


Table 1. 



162 


FEM Model FEM Model 


Method 

Speedup 

Speedup 

Efficiency 

Efficiency 

R/B/G SOR 

2.84 

2.93 

71% 

73.3% 

CG 

2.82 

2.90 

71% 

72.5% 


Table 1, 4- FEM and Model Results 


The efficiency of 71% In Table 1 Indicates that the parallel overhead due 
to communication and synchronization between processors was only 12% 
of the execution time since 17% efficiency was lost because the number 
of equations was not evenly divisible by the number of processors. We 
note that the efficiencies predicted by the model agree very closely with 
the 4-processor FEM results. 

7.3. Model Results 

The questions that we will answer with the model are Itemized 
below: 

(1) What ratio, a. of communication time to arithmetic time must 

the array computer have to efficiently support the implementation 
of an algorithm, and how does this ratio change as p 

Increases? 

(2) For a given algorithm and ratio a. what is the maximum 

number of processors that should be used to achieve a given 
efficiency level? 

(3) If the global bus Is used to do the Inner products, will the m- 
step PCQ(SSOR) be a more efficient algorithm than the standard 
conjugate gradient method (m= 0)? 


These questions can be answered by analyzing the speedup of 
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a p-processor algorithm over its corresponding single processor 
version and will be discussed In section 7.3.1. 

(4) How does an algorithm's performance on an array such as FEM 
with certain hardware speeds compare with Its performance on 
a benchmark hardware? A measure of this performance will be 
called the para-efflclency and will be discussed in section 7.3.2. 

(5) For a given a. what Is the best algorithm for solving the prob- 
lem as p Increases? 

(6) For a given algorithm, how does the execution time change as 
a changes? In particular, below what a level will the execution 
time fail to decrease significantly? 

(7) For a given a. if the global bus Is used for the inner products, 
for what values of m will the execution time of m-step 
PCG(SSOR) be less than that of standard conjugate gradient 
Cm= 0)? 

The answers to these questions are found by examining the 
execution times as a function of p and a and are given in 
section 7.3.3. 

(8) What is the tradeoff between a decrease in speed and an 
increase in the chance of machine failure as p Increases? 
This Issue of reliability Is discussed in section 7.3.4. 

(9) For a given algorithm and ratio a. how many processors are 
necessary to be competitive with a conventional machine and 
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problem solver? This question is the topic of section 7.3.5. 


The model was used to answer the above questions for the algo- 
rithms of Chapter 5 and 6 as applied to the following two test problems. 
This first problem is Laplace's equation on a rectangular domain with 
Dirlchlet boundary conditions. The domain is discretized Into 18 rows 
and 50 columns of nodes so that the values at the 16 by 48 grid of 
Interior nodes are to be found. The resulting stiffness matrix K has 

dimension 768 by 768. If the nodes are ordered by the classical 
Red/Black scheme, the problem can be solved on a machine consisting 
of the following number of processors which are assigned the 

corresponding grid sizes. 


Processors 


Grid size/processor 


1 

4 

16 

64 

128 

384 


16x48 
8X24 
4x12 
2X 6 
2x 3 
IX 2 


Table 2. Processor Assignments 
Laplace's Equation 


The second problem Is the plane stress problem of Chapter 2. A 
rectangular plate is discretized by linear triangular finite elements as 
shown in Figure 2 so that the displacements, u and v. at 16 rows and 
48 columns of nodes must be found. The resulting stiffness matrix K 
has dimension 1536 by 1536. These nodes are colored Red/Black/Green 
as described in Chapter 5. This problem can be solved on a machine 
with the following number of processors by assigning the corresponding 
grid of nodes to each processor. 
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Processor Grid size/processor 


1 

16X48 

4 

8x24 

16 

4x12 

64 

2X 6 

256 

IX 3 


Table 3. Processor Assignments 
Plane Stress Problem 


We note that the execution time per iteration of an algorithm is a func- 
tion of the number of equations per processor as well as the number of 
processors. If we fix the size of the problem, and allow the number of 
processors and hence the number of equations per processor to vary, 

we get results like those In the next five sections. Although not done 

here, the size of the problem. N. could be varied and the model used 
to predict the dependence on N for a fixed number of processors. 

However, for algorithms like Multi-color SOR which do not require any 
global communication or summation, the execution time is only a function 
of the number of equations per processor and remains constant as N 

increases. 


7.3.1. Speedup Results 

The speedup as a function of the number of processors. 

_ ... Execution timet 1) 

Speedup <p> = 

for the Multi-Color SOR. the standard conjugate gradient, and the m-step 
(SSOR) preconditioned conjugate gradient methods of Chapters 5 and 6 
can be predicted by the model for a Finite Element Machine with partic- 
ular arithmetic and hardware/software communication times. Since speedup 
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is a measure of how well the architecture of the machine can support 
the implementation of an algorithm, or conversely, how well an algorithm 
performs on a particular machine, these arithmetic and communication 
times can be viewed as design variables for the machine and by chang- 
ing these times we can determine what ratio of communication to arith- 
metic the machine must have to efficiently support a given algorithm. 

Since the barrier operation and the flag checks occur only once per 
iteration, the send and receive operations comprise the majority of the 
communication between processors. Therefore, to analyze the perfor- 
mance of an algorithm as the ratio of communication to arithmetic time 
changes, we choose to vary T. G, and R In (7.1) and let the values of 
S. F. S. and D remain constant. Although 7. G. and R can have dif- 
ferent values, we analyze the case where these values are equal and 
denote this value by a. We will refer to a as the ratio of communica- 
tion to arithmetic time. 

-2 -1 

The model was run with four values of a: namely. 10 . 10 . 1. 

and 10 in order to determine an algorithm's performance over a wide 
range of values and to aid In machine design. Once this is done, the 
results can be used to determine a smaller interval for additional model 
runs. Table 4 gives the four sets of model values where each set Is 
regarded as describing a particular Finite Element Machine. 
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FEM Machine 

T 

G 

R 

B 

F 

S 

D 

V 

1 

10.00 

10.00 

10.00 

1.85 

1.56 

* 

.5592 

1 

2 

1.00 

1.00 

1.00 

1.85 

1.56 

* 

.5592 

1 

3 

.10 

.10 

.10 

1.85 

1.56 

* 

.5592 

1 

4 

.01 

.01 

.01 

1.85 

1.56 

* 

.5592 

1 


Table 4. Four Sets of Model Costs 
* (see 7.3.4) 


The times to perform a barrier synchronization and a global flag test 
were taken to be the times for these operations on the current machine. 
An approximate time for one use of the sum/max circuit was given by 
Jordan. et.al.[1979J to be 

48 + loggp jjls (7.3.3) 

To this was added the time to place a value on the circuit (assumed to 

be one send) and the time to read the sum from the circuit (assumed 

to be one receive). For example, if the time for 1 multiplication/addition 
-4 

pair Is 10 seconds, the value of 5 is given by 

S = T +(0.01)(48+log 2 p) + R (7.3.4) 

For each of the machines In Table 4. Speedup(p) was determined 
for the R/B SOR. standard conjugate gradient (Global Bus), standard 
conjugate gradient (Sum/Max), and R/B 1.2-step SSOR PCG (Bus) and 
(Sum/Max) methods for the solution of Laplace's equation. These speed- 
ups are given In Table 5. 
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PCG( SSOR) 

PCG( SSOR) 

a 

£ 

R/B 

SOR 

m«0 

(Bus ) 
m-1 rn-JJ 

m-0 

(S/M) 

m-1 

m-2 


4 

2.7 

3.0 

2.4 

2.5 

3.0 

2.4 

2.5 


16 

5.7 

5.7 

4.0 

4.5 

7.2 

4.5 

4.8 

10 

64 

14.6 

4.9 

5.0 

6.7 

18.5 

10.8 

11.7 


128 

26.2 

3.0 

3.7 

5.6 

31.2 

18.8 

20.6 


384 

55.2 

1.1 

1.4 

2.5 

57.7 

36.6 

41.6 


4 

3.8 

3.9 

3.7 

3.8 

3.9 

3.7 

3.7 


16 

13.5 

13.5 

12.3 

12.7 

14.2 

12.7 

12.9 

l 

64 

46.6 

28.8 

29.0 

34.7 

50.1 

42.3 

43.9 


128 

87.5 

24.6 

28.8 

41.2 

93.3 

78.6 

82.8 


384 

211.0 

10.5 

13.9 

23.9 

219.2 

184.2 

202.2 


4 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 


16 

15.6 

15.6 

15.5 

15.6 

15.7 

15.5 

15.6 

.1 

64 

59.6 

55.9 

56.3 

58.5 

60.5 

59.6 

60.5 


128 

114.4 

87.3 

92.9 

104.1 

116.5 

115.5 

118.6 


384 

295.2 

81.9 

102.1 

149.8 

304.5 

308.5 

329.5 


4 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 


16 

15.8 

15.9 

15.9 

15.9 

15.9 

15.9 

15.9 

.01 

64 

61.3 

61.7 

62.1 

62.8 

61.8 

62.2 

62.9 


128 

117.9 

117.1 

119.4 

122.9 

119.5 

121.2 

123.9 


384 

307.4 

254.6 

278.6 

316.5 

316.9 

330.8 

351.6 


Table 5. speedups for Laplace's Equation 


Graphs 1A. 2A. and 3A show speedup as a function of p. the number 
of processors, for the R/B SOR. the CG(Bus). and the R/B 2-step SSOR 
PCG(Bus) algorithms respectively. Each of these graphs shows the cases 
a=.01. a=.l. 1 and 10 from top to bottom of the graph respectively. 

Graphs IB, 2B. and 3B show the speedup as a function of the ratio a 
for the same algorithms. Each of these graphs shows the cases of 

p =4.16.64, 128. and 384 and the five dotted lines Indicate perfect speedup 
for these values of p. Graphs 1A and IB show for R/B SOR that 

p =384 processors solve the problem In the shortest time Also note 

from Graph 1A that only a slight Improvement in speedup is seen for 

a=0.01 over a=0.1; whereas, a large drop in speedup can be seen by 


SPEEDUP 



GRAPH 1A. R/B SOR 
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increasing a from 1 to 10. 

However, for CG(Bus). a different story can be seen from Graph 2A 
and 2B. The number of processors that solves the problem in the 
shortest time Is not constant and in fact varies drastically with a. Note 
that the largest speedup is obtained with p=384 only for a=0.01 while 
p=128 gives the largest speedup only lor a=0.1. At the level a=l. 64 

processors gives the maximum speedup and at a=10. the speedup 

decreases for more than 16 processors. Also note that for p - 4 and 

p=16. the efficiency Is quite good for a=l; whereas lor p= 64 and 128, 
a=0.1 is strongly preferred over o=l and 384 processors should only be 
considered If a Is 0.01 or less. 

By comparing Graphs 3A and 3B with Graphs 2A and 2B. we see 
that some improvement over CG(Bus) can be achieved by using 2 steps 
of SSOR PCG(Bus) and m - 2 Is seen to be best the value of m from 
Table 5. In particular. p=384 gives the best efficiency at the at=0.1 

level for the 2-step SSOR PCG(Bus) algorithm whereas p=128 was best 
for CG(Bus). In addition, 64 processors give the best speedup at the 
a=10 level for the 2-step SSOR PCG(Bus) algorithm whereas p=16 gave 
the best speedup for this value of a for the CG(Bus) algorithm. 

The speedups for the same algorithms for the plane stress problem 


are given In Table 6. 
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PCG(SSOR) POG(SSOR) 



R/B/G 

(Bus ) 


(S/M) 


£ 

SOR 

m-0 

m-1 

m»2 

m-0 

m-1 

m»2 

4 

3.3 

3.4 

3.1 

3.1 

3.4 

3.1 

3.1 

16 

9.0 

8.9 

7.1 

7.2 

9.7 

7.4 

7.4 

64 

25.9 

13.1 

13.4 

15.4 

28.3 

20.1 

20.0 

256 

68.4 

5.6 

8.3 

12.3 

72.3 

49.6 

49.7 

4 

3.9 

3.9 

3.9 

3.9 

3.9 

3.9 

3.9 

16 

14.8 

14.8 

14.2 

14.3 

15.0 

14.3 

14.3 

64 

55.4 

45.9 

46.4 

48.5 

56.4 

52.3 

52.4 

256 

196.0 

46.3 

63.9 

85.4 

198.8 

178.2 

179.3 

4 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

16 

15.8 

15.8 

15.8 

15.8 

15.9 

15.8 

15.8 

64 

62.5 

61.2 

61.4 

61.9 

62.6 

62.3 

62.4 

256 

240.9 

173.2 

194.5 

211.7 

240.9 

240.5 

242.6 

4 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

4.0 

16 

16.0 

16.0 

16.0 

16.0 

16.0 

16.0 

16.0 

64 

63.3 

63.3 

63.5 

63.5 

63.3 

63.5 

63.6 

256 

246.6 

238.5 

244.5 

248.5 

246.1 

249.3 

251.4 


Table 6. Speedups for the Plane Stress Problem 


Graphs 4A. 5A. 6A, and 7A show speedup as a function of p, the 
number of processors, for the R/B/G SOR. the CG(Bus). the R/B/G 2- 
step SSOR PCG(Bus). and the R/B/G 2-step SSOR PCG(Sum/Max) algo- 
rithms respectively. Each of these graphs shows the cases a=.01. a=. i. 
a=l and a=10 from top to bottom of the graph respectively. Graphs 4B. 
5B. 6B. and 7B show speedup as a function of a for the same algo- 
rithms where the four dotted horizontal lines respresent perfect speedup 
for p=4. 16. 64. and 256. From Graphs 4A and 4B the R/B/G SOR 
algorithm Is seen to be very efficient on machines with a as large as 1 
for p< 64. Even though the speedup drops drastically for a=10. p = 256 
processors still solve the problem faster. From Graphs 5A and 5B and 
6A and 6B. we see a situation similar to Graphs 2A and 2B and 3A 
and 3B for Laplace's equation; namely. If CG(Bus) is used. p= 256 Is 
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preferred at the a=.l level, p - 64 at the a=l level (perhaps a=10 also). 
From Graph 6A and 6B. we see that with the use of 2-step SSOR 
PCG(Bus) algorithm, p = 256 gives the largest speedup up to level a=l. 
but for a=T0. p= 64 is the maximum number of processors to use. 

Graph 7A and 7B show that the sum/max hardware circuit greatly 
improves the efficiency of the 2-step SSOR PCG algorithm. In particular. 
p= 256 solves the problem In the fastest time for all values of a. 

It should be noted that speedup is not a viable measurement to 
compare parallel algorithms since the algorithm that can be implemented 
with the least parallel overhead (most efficient) may still take longer to 
execute due to extra arithmetic calculations. This was the case with 
R/B/G SOR applied to the plane stress problem. The algorithm is very 
efficient for all values of a but takes too long to converge to a solution 
to be competitive with any of the conjugate gradient type algorithms. 
Execution time comparisons are given in section 7.3.3. 


7.3.2. Para-efflclency 

Schwartz[1979] recommended the para-efficiency 


PEFF (p ) = 


E(p ,H) 


£(p ,p -array) 

as a good measure of an algorithm's performance on a particular 

hardware configuration H. The para-efflclency is the ratio of the execu- 
tion time of an algorithm on a particular hardware. H . to the execution 

time on the p-array. The p-array is an unrealizable hardware configura- 

tion In which all p processors are connected to each other and a cen- 
tral shared memory. The time to write into and read from this memory 
is assumed to be neglible compared with arithmetic time. Also no 
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memory contention is assumed for the p-array. This p-array Is 

envisioned as an ideal architecture for p processors since the overhead 
due to passing data between them is as small as possible. 

The FEM can be envisioned as a p-array If the send time is equal 
to the time to write to memory and the receive time Is equal to the 

time to read from memory. Note that FEM never has the problem of 
memory contention. Let us define our p-array. or benchmark computer, 
to be a FEM for which send and receive operations over the local links 
and the global bus takes 0.1/zs each and the barrier and flag tests 

require the same time as given In Table 4. This definition views our 

benchmark computer to be nearly Ideal In the sense of sending and 

receiving data but communication overhead can still occur in the form of 
barrier or flag checking operations. This is consistent with the assump- 
tion in Table 4 that the values of 8, F. S, and 0 are constant and 

that the major overhead is due to the send and receive operations The 

PEFFCp ) values for the machine configurations of the last section are 

given in Table 7 for Laplace's equation. 
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PCG ( SSOR ) 

PCG 

( SSOR ) 



r/b 

( BUS ) 

(S/M) 

a 

£ 

SOR 

m -0 

m -2 

m -0 

m -2 


4 

1.5 

1.3 

1.6 

1.3 

1.6 


16 

2.8 

2.8 

3.5 

2.2 

3.3 

10 

64 

4.2 

12.7 

9.4 

3.3 

5.4 


128 

4.5 

40.3 

22.3 

3.8 

6.0 


384 

5.6 

298.0 

142.5 

5.5 

8.5 


4 

1.0 

1.0 

1.1 

1.0 

1.0 


16 

1.2 

1.2 

1.3 

1.1 

1.2 

l 

64 

1.3 

2.2 

1.8 

1.2 

1.5 


128 

1.4 

4.9 

3.0 

1.3 

1.5 


384 

1.5 

30.7 

14.9 

1.5 

1.8 


4 

1.0 

1.0 

1.0 

1.0 

1.0 


16 

1.0 

1.0 

1.0 

1.0 

1.0 

.1 

64 

1.0 

1.1 

1.1 

1.0 

1.1 


128 

1.0 

1.4 

2.1 

1.0 

1.1 


384 

1.1 

3.9 

2.4 

1.0 

1.1 


4 

1.0 

1.0 

1.0 

1.0 

1.0 


16 

1.0 

1.0 

1.0 

1.0 

1.0 

.01 

64 

1.0 

1.0 

1.0 

1.0 

1.0 


128 

1.0 

1.0 

1.0 

1.0 

1.0 


384 

1.0 

1.3 

1.1 

1.0 

1.0 


Table 7 . Para-efficiencies for Laplace's Equation 


Several conclusions follow from Table 7. 

(1) a=.01 Is virtually as good as the p-array for ail the algorithms. 

(2) For a=0.1. an array like the FEM supports very efficiently all 
algorithms that do not use the bus. In addition, the R/B SSOR 
PCG(Bus) algorithm Is very efficient for p<64. 

(3) For a=l. the bus should only be considered when p< 16. 

(4) For a=10, only p= 4 yields a good para-efflclency. 

The values of PEFF(p) are given in Table 8 for the plane stress 


problem. 
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PCG(SSOR) 

PCG 

( SSOR ) 



(Bus) 

(S/M) 

a 

£ 

m-o 

m«2 

m-0 

m-2 


4 

1.2 

1.3 

1.2 

1.3 

10 

16 

1.8 

2.2 

1.6 

2.2 


64 

4.8 

4.1 

2.2 

3.2 


256 

44.6 

20.5 

3.4 

5.0 

l 

4 

1.0 

1.0 

1.0 

1.0 


16 

1.1 

1.1 

1.1 

1.1 


64 

1.4 

1.3 

1.1 

1.2 


256 

5.4 

3.0 

1.2 

1.4 


4 

1.0 

1.0 

1.0 

1.0 

.1 

16 

1.0 

1.0 

1.0 

1.0 


64 

1.0 

1.0 

1.0 

1.0 


256 

1.4 

1.2 

1.0 

1.0 


4 

1.0 

1.0 

1.0 

1.0 

.01 

16 

1.0 

1.0 

1.0 

1.0 


64 

1.0 

1.0 

1.0 

1.0 


256 

1.0 

1.0 

1.0 

1.0 


Table 8. Para-efficiencies for Plane Stress Problem 

Several conclusions follow from Table 8. 

(1) a=.01 Is virtually as good as the p-array for all the algorithms. 

(2) a=0.1 is virtually as good as the p-array for all algorithms that 
do not use the bus. 

(3) For a=l and at=10, the para-efficiency is more dependent on 
p:hence. the sum/max circuitry Is becoming more Important for 
larger p. 

(4) m =0 Is more efficient for a=l and a=10 even though m =2 is 
seen from the next section to yield a smaller execution time. 

7.3.3. Execution Time Results 

If we consider factors such as simplicity and maintainability to be 
equal for all our algorithms, the best parallel algorithm to solve a given 
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problem will naturally be the one that requires the least time to execute. 
The execution times for the parallel algorithms are given in Table 9 for 
Laplace's equation. 

PCG(SSOR) PCG(SSOR) 

R/B (Bus) (£/M) 


a 

£ 

SOR 

ra— 0 

m-1 

»-2 

m-0 

m-1 

m-2 


l 

30.1 

47.3 

34.6 

35.3 

47.3 

34.6 

35.3 


4 

11.1 

15.8 

14.6 

14.3 

15.6 

14.4 

14.2 

10 

64 

2.1 

9.6 

7.0 

5.1 

2.6 

3.2 

3.0 


128 

1.2 

15.7 

9.5 

6.0 

1.5 

1.8 

1.7 


384 

.55 

43.7 

23.9 

13.9 

.82 

.94 

.85 


4 

7.9 

12.2 

9.2 

9.4 

12.2 

9.2 

9.4 


16 

2.2 

3.5 

2.8 

2.8 

3.3 

2.7 

2.7 

1 

64 

.65 

1.6 

1.2 

1.0 

.94 

.82 

.80 


128 

.34 

1.9 

1.2 

.86 

.51 

.44 

.43 


384 

.14 

4.5 

2.5 

1.5 

.22 

.19 

.17 


4 

7.6 

11.9 

8.7 

8.9 

11.9 

8.7 

8.9 


16 

1.9 

3.0 

2.2 

2.3 

3.0 

2.2 

2.3 

.1 

64 

.51 

.85 

.61 

.60 

.78 

.58 

.58 


128 

.26 

.54 

.37 

.34 

.41 

.30 

.30 


384 

.10 

.58 

.34 

.24 

.16 

.11 

.11 


4 

7.5 

11.9 

8.7 

8.8 

11.9 

8.7 

8.8 


16 

1.9 

3.0 

2.2 

2.2 

3.0 

2.2 

2.2 

.01 

64 

.49 

.77 

.56 

.56 

.77 

.56 

.56 


128 

.26 

/. 40 

.29 

.29 

.40 

.29 

.28 


384 

.10 

.19 

.12 

.11 

in 

H 

.10 

.10 


Table 9. Execution Times (sec) for Laplace's Equation 
( 1 mult/add « 0 . 0001 sec . ) 


Graphs 8A and 8B show the execution time (seconds) versus p. the 
number of processors, for the R/B SOR and the m=0.1.2 step 
PCG(SSOR)(S/M) algorithms for a=10 and a=l respectively (for a=l. m= 2 
is not shown). R/B SOR Is seen to be the fastest method for Laplace's 
equation for both a levels. It is Interesting to note from Graph 8A that 
there exists a value of p between 4 and 16 for which taking more than 
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0 steps of m-step PCQ(SSOR) is not cost effective. The reason for this 
Is that the time to do the extra communication required for the precon- 
ditioner Is more than the time gained by the fewer number of iterations. 
Note that when the cost of communication is reduced as in Graph 8B. 
the ranking of algorithms for all p is the same as the ranking for the 
case p s 1. Furthermore, this ranking will continue as a decreases. 

Graph 9A shows the execution time (seconds) versus a for the R/B 
SOR algorithm for the cases p=l(dotted). p= 4. p=16. p=64. p=128 and 
p= 384. Graph 9B shows execution time versus p for the cases cr=.l. 
a=l and a=10 for the R/B SOR algorithm. Note that p=384 gives the 
least exeuction time for all a levels. Both graphs show a slight 
increase in execution time from increasing a from .1 to 1 for p>64 and 
a larger increase from increasing a from 1 to 10 for all values of p. 
Graph 9A also can be used to help answer the question of tradeoff 
between faster communication and more processors. For example, for a 
machine with a=10 and p=64. a greater reduction in execution time can 
be realized by going to a a=l.p=64 machine (faster communication) 
rather than adding more processors to get an a=10,p=128 or possibly 
p=384 machine. 

An examination of Graph 9B shows for a=10. the graph changes 
convexity between p= 4 and p= 64. In other words, the execution time 
decreases more from p=16 to p= 64 than from p= 4 to p=16. The rea- 
son for this Is that when p* 16 there are completely Interior processors 
and the number of local receives double (7.11) and the number of sends 
virtually double (7.10). When a=10 this factor is amplified but for a=1 
or a=. 1, the communication cost is much less and this factor is not 


noti cable. 
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Graph 9B also Illustrates that the execution time reduces by much 
more when decreasing a from 10 to 1 than the reduction seen when a 
Is reduced from 1 to 0.1. This suggests that there will be an optimal 
value of a below which the gains in execution time reduction will not 
justify the cost of making the machine's communication faster. In fact. 
a=.01 gives almost identical execution time results as a=.1 for R/B SOR 
and was not included in Graph 9B. 

The execution times for the parallel algorithms for the plane stress 
problem are given In Tables 10A and 10B for the global bus and 
sum/max circuit respectively. 


PCG( SSOR) 


< 


a 

£ 

m=0 

m*l 


l 

1115.0 

705.0 


4 

328.0 

230.0 

10 

16 

126.0 

99.0 


64 

85.0 

52.0 


256 

201.0 

85.0 


4 

284.0 

182.0 

l 

16 

75.0 

50.0 


64 

24.3 

15.2 


256 

24.1 

11.0 


4 

279.0 

177.0 

.1 

16 

70.0 

45.0 


64 

18.1 

11.5 


256 

6.4 

3.6 


4 

279.0 

176.0 

.01 

16 

70.0 

44.0 


64 

17.6 

11.1 


256 

4.7 

2.8 


Bus) 


m«2 m=3 m-4 


597.0 

862.0 

798.0 

195.0 

286.0 

267.0 

83.0 

123.0 

116.0 

39.0 

55.0 

50.0 

49.0 

58.0 

47.0 

154.0 

222.0 

206.0 

42.0 

61.0 

56.0 

12.3 

17.6 

16.2 

7.0 

8.9 

7.5 

150.0 

216.0 

200.0 

38.0 

55.0 

51.0 

9.7 

13.9 

12.9 

2.8 

3.9 

3.3 

149.0 

215.0 

200.0 

37.0 

54.0 

50.0 

9.4 

13.5 

12.5 

2.4 

3.4 

3.2 


Table 10A. 


Execution Time ( seconds ) 
Plane Stress Problem (Bus) 
(1 mult/add » 0.0001 sec) 
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PCG( SSOR ) 
( Sum/Max) 


a 

2 

m-0 

m-1 

m-2 

m-3 

m—4 


l 

1115.0 

705.0 

597.0 

862.0 

798.0 


4 

326.0 

229.0 

195.0 

285.0 

267.0 

10 

16 

115.0 

95.0 

81.0 

121.0 

114.0 


64 

39.0 

35.0 

30.0 

45.0 

43.0 


256 

15.0 

14.0 

12.0 

18.0 

17.0 


4 

284.0 

182.0 

154.0 

222.0 

206.0 

l 

16 

74.0 

49.0 

42.0 

61.0 

56.0 


64 

19.8 

13.5 

11.4 

16.6 

15.5 


256 

5.6 

4.0 

3.3 

4.9 

4.5 


4 

279.0 

177.0 

150.0 

216.0 

200.0 

.1 

16 

70.0 

45.0 

38.0 

55.0 

51.0 


64 

17.8 

11.3 

9.6 

13.8 

12.8 


256 

4.6 

2.9 

2.5 

3.5 

3.3 


4 

279.0 

176.0 

149.0 

216.0 

200.0 

.01 

16 

70.0 

44.0 

37.0 

54.0 

50.0 


64 

17.6 

11.1 

9.4 

13.5 

12.5 


256 

4.5 

2.8 

2.4 

3.4 

3.2 


Table 10B. 

Execution Time ( seconds ) 

Plane Stress Problem (Sum/Max) 
( 1 mult/add = 0 . 0001 3ec ) 


Graphs 10A and 10B show the execution time (seconds) versus p 
for the 0.1,2-step PCG(SSOR)(Bus) algorithms for a=10 and a=l respec- 
tively. For both graphs, m= 2 solves the problem in the least time for 

all values of p. Note from Graph 10A that the execution time increases 
if 256 processors are used. This is because communication is expensive 
for a=10 and the cost of the bus for this many processors Is prohibitive. 
Graph 10B shows that If communication is less expensive (a=l), that 
p=256 processors still give a further reduction in execution time for m=2 
and m= 1 but not for m= 0. This clearly shows for a large number of 
processors that preconditioning is necessary to reduce the number of 
iterations In order to decrease the global communications even though 
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more local communications result. 

The execution time for the same algorithms and the sum/max circuit 
for the Inner products is plotted In Graphs 11A and 11 B for a=10 and 
a=1 respectively. By comparing these graphs with Graphs 10A and 10B, 
the need for special hardware to do the inner products for the conjugate 
gradient methods Is apparent whenever p> 64. 

The time gained In the reduction in the number of Iterations with 
m >0 Is greater than the extra time required for the preconditioner com- 
munications; hence. m= 1,2 are faster than m= 0 for both graphs. Note 

from Graph 11 B that more of an improvement Is seen with a=1 because 
of reduced communication. The convexity changes around p=16 in 

Graph 11 B reflect the Increase in communication cost of completely inte- 
rior processors as was discussed for Graph 9B. Again, when a=1, this 
increase is not observed in Graph 11B. 

Graph 12A shows the execution time (seconds) for R/B/G 2-step 

SSOR PCG(S/M) as a function of a with special cases shown for 
p=l(dotted). p =4. p=16, p =64, and p= 256. For all a levels, p =256 
yields the least execution time. However, note that it may be more cost 
effective to use 64 processors and make communication faster (a=1) 
rather than use a=10 and and Increase p to 256. 

Finally. Graph 12B shows execution time (seconds) versus p for the 
R/B/G SSOR PCG(Sum/Max) algorithm for the special cases a=10. a=l. 

and a=0.1. This graph closely resembles Graph 9B for R/B SSOR 
PCG(Sum/Max) for Laplace's equation In particular, for p<64 there Is 

very little difference between the execution time for a=.l and a=1, but 
there Is a significant execution time difference between a=.1 and a=10. 



EXECUTION TIME (SEC) 



EXECUTION TIME (SEC) 
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This suggests that there an optimal value of a. below which it is not 
cost effective to produce faster hardware. 

7.3.4. Reliability Considerations 

Speedup (p) is an Indication of how much faster a problem can be 
solved on p processors than on a single processor. However, this 
measurement does not account for the fact that as more processors are 
added, the machine may be less reliable because of an increased pro- 
bability of component failures. 

We now define the reliability. R(p.t ). of the array computer to be 

the probability that all p processors will run at least t units of time 

without failure. To derive an expression for Rip.t ) the following assump- 
tions must be made: 

(1) The machine fails if any one of Its processors fail. 

(2) The failure of a processor Is independent of the failures of the 

other processors. 

(3) The failure rate. 0. for each processor is constant. 

(4) The reliability is a decreasing function with time. 

We are interested only In the failures due to chance and not 

failures due to component ’burn-ln* or old age. as shown below (see 
Mlller(1977]). 



With these assumptions, the reliability of one processor is given by 
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fld^) 



(7.4.1) 


and the mean time until failure for one processor is 



(7.4.2) 


Now. assumption (2) Implies that the multiplicative law of probability holds 
for the p -processor case and that 

-Bpt 

R(p.tp) - e P (7.4.3) 

where f denotes the time to solve the problem on a p-processor 
machine. 


The mean time until failure for the p-processor machine Is given by 



If Eff(p) represents the efficiency with p processors. 


. *1 ' j _ 

pt p ~ speedup Ip) ~ Etf (p) 

and by using (7.4.5). equation (7.4.3) can be written as 


fl(p.fp) = e 


-*1 

eff(p) 


(7.4.5) 


(7 4.6) 


Note from (7.4.6) that the reliability Is an Increasing function of the effi- 
ciency. Hence. If we expect the efficiency to decrease with the number 
of processors, so will the reliability. However. If the machine parameter 
a and the algorithm were such that the efficiency remained nearly con- 
stant as the number of processors Increased, the reliability would also 
be nearly constant for all p. 
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To Illustrate these concepts. Table 11 shows how the reliability changes 
as the length of the solution time for the problem on a single processor 
Increases. The speedups for the R/B/G SSOR PCG(Sum/Max) algorithm 
for the plane stress problem were used to calculate the efficiencies in 
(7.4.6). The failure rate of a single processor, £, was taken to be 
0.001 (0.001 failures every 1 unit of time), where a unit of time can be 
specified to be any amount of time that precisely defines the failure rate. 


Uniprocessor 
Job Length 

R-i 

£-4 

£=16 

p=64 

£=256 

(time units) 






1 

99.90 

99.87 

99.78 

99.68 

99.49 

10 

99.00 

98.72 

97.86 

96.85 

94.98 

100 

90.48 

87.89 

80.56 

72.61 

59.74 



a=10. 

£=0.001 



1 

99.90 

99.90 

99.89 

99.88 

99.86 

10 

99.00 

98.98 

98.89 

98.79 

98.58 

100 

90.48 

90.25 

89.41 

88.50 

86.69 



a=l , 

£=0.001 



1 

99.90 

99.90 

99.90 

99.90 

99.90 

10 

99.00 

99.00 

98.99 

98.98 

98.95 

100 

90.48 

90.48 

90.37 

90.25 

89.99 



a=.l. 

£=0.001 




Table 11. Reliabilities 


Graph 13 shows the results from Table 11 for a=l. The s graph indicates 
that the efficiency drops as the number of processors increases as was 
expected. 



RELIABILITY (%) 
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This factor should be kept in mind when analyzing the execution 
time graphs of section 7.3.3. Even though these graphs may show a 
decrease in execution time as more processors are added, the decrease 
may not be enough to balance the decrease in reliability. The level of 
reliability that is required is certainly a design issue for any machine. 

7.3.5. Conventional Machines/Soivers 

A natural question to ask Is how well the execution time of an 
algorithm on an array computer such as the Finite Element Machine 
compares to the execution time of a conventional algorithm on a con- 
ventional machine. In other words. It may be of interest to determine 
the number of processors, p. and the communication to arithmetic ratio. 
a. that are needed for the algorithm on the array to be competitive with 
a benchmark algorithm Implemented on a benchmark machine. 

To answer this question we make several assumptions. First, we 
assume that a direct method such as banded Cholesky decomposition 
followed by forward and backward substitution will be the benchmark 
method. Secondly, assume that all the bands of the stiffness matrix K 
will fit into core storage of the benchmark computer We note that this is 
the best possible sltutatlon for the benchmark algorlthm/machlne since 
for large problems time must be spent In bringing the matrix to and 
from core and backing store. Lastly, we assume three times, J3, for 
performing one multiplication/addition pair on the benchmark computer; 
namely. 10 5 .10 6 , and 10 7 pa. 

The number of multiplication/addition pairs for the banded Chloesky 
algorithm Is easily calculated. George and Liun981], to be 
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j (b)(b+3)/V - (^H-b 2 +|b) 

where b Is the semi-bandwidth. The number of multlplicatlon/addltlon 
pairs for both the forward and backward subitltution Is 

„2 h 

Cb+UA/ - (^tf) 

where for the plane stress problem with an interior grid of 16 by 48 the 
bandwidth and number of equations are given by 

b = 96 
N = 1536 

Hence, the number of muitiplicatlon/addition pairs, a. Is easily found to 
be 

a = 7891936 

-5 -6 -7 

For the three arithmetic speeds. £=10 .10 , and 10 .the time in 
seconds for this algorithm is a linear function of £ and equals 78.92. 
7.89, and .79 for the three £ values respectively. 

Graph 14 shows the execution time as a function of the number of 
processors for a=0.1, 1. and 10 for the R/B/G SSOR 2-step 

PCG(Sum/Max) algorithm where the dotted horizontal lines represent the 
three execution times of the conventional solver on the conventional 
machine. The numbers of processors required to yield a smaller execu- 
tion time than the Choiesky algorithm for the three values of £ are 


summarized in Table 12. 
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£ 


.1 

> 

256 

1.0 

> 

256 

10.0 

> 

256 


,8=1 0 -7 


.1 

> 

64 

1.0 

> 

64 

10.0 

> 

256 


,8=1 0" 6 


.1 

> 

4 

1.0 

> 

4 

10.0 

> 

16 


,8=1 (f 5 



Table 12. Comparison to a Conventional Solver/Machine 

We acknowledge that the comparison to a conventional solver Is dif- 
ficult to make because the story may change completely as the problem 
size grows. For instance, direct methods will require more storage and 
the number of operations may no longer be competitive with Iterative 
methods, especially if good Initial guesses are known. For an array 
computer like FEM. the storage Is distributed across the processors and 
iterative methods do not require storage of any nonzero elements of 
the stiffness matrix K : hence, extra time will not be as likely to be 
needed to move the data to and from backing store to core as would 
be true with direct methods. The point to be made here is that this 
type of analysis Is simple once the benchmark aigorlthm/machlne are 
determined and realistic times for this algorithm/machine are obtained. 



CHAPTER 8 


Conclusions and Future Directions 


8.1. Conclusions 

Two algorithms were developed In Chapter 4 for assembling the 
system of linear equations by the finite element method on array com- 
puters. The first algorithm required no communication between proces- 
sors but resulted in a duplication of effort among the processors. The 
second algorithm required no duplication of effort at the expense of 
communication between processors. Analytic formulas were obtained for 
the speedup, efficiency, and overhead of these algorithms on a p- 
processor array. The more efficient algorithm was shown for p >4 pro- 
cessors to be a function of the ratio of the time to send and receive 
one value to the time to calculate one coefficient of the stiffness matrix 
For p=4 processors, the choice of algorithms also depends on the size 
of the grid of unknowns that is assigned to each processor. We also 
described In Chapter 4 how to calculate the stress vector in parallel 
without communication between processors or duplication of effort. 

in Chapter 5 we developed a new stationary Iterative method, called 
Multi-color SOR, for solving the large sparse linear systems arising from 
both finite element and finite difference discretizations. Tnis method Is a 
generalization of the classical Red/Black ordering and allows the succes- 
sive overrelaxatlon (SOR) method to be Implemented on both vector com- 
puters and parallel arrays as a multiple sweep Jacobi method which has 
Ideal properties for these machines. 
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The stiffness matrix K that results from a Multi-color ordering of the 
problem grid, was shown In general not be be consistently ordered, p- 
cyclic. or generally consistently ordered; therefore, the development of a 
theory for this class Of matrix that will lead to the determination of the 
optimal relaxation factor u Is yet to be found. Numerical results show 
that the SOR method with the Multi-color ordering and the natural order- 
ing of the grid converges In approximately the same number of itera- 
tions; therefore, the coloring of the grid for our test problems was not 
detrimental to the convergence rate of the method. 

An efficient implementation of a Multi-color SSOR method that Is 
based on a forward followed by a backward Multi-color SOR step was 
also given in Chapter 5. Numerical results for this method for a plane 
stress problem show that the optimal u> Is close if not equal to 1. It is 
well known that the optimal w for the Red/Black ordering of a matrix 
with Property A is 1. but it has yet to be proved whether or not this is 
true for general Multi-colored matrices. 

Lastly, in Chapter 5, the Multi-color SOR method was generalized to 
the Block Multi-color SOR method, if the grid points In each block are 
from k consecutive rows (or columns) of the problem grid so that the 
matrix will be 7r-consistently ordered, (see YoungI1971]), a theory exists 
for determining the optimal relaxation factor. On the other hand. If the 
grid points are blocked by Ixk blocks of convenient size for implementa- 
tion on a array of processors, It is generally not the case that the 
matrix will be ^-consistently ordered. 

In Chapter 6, we developed and analyzed an m-step preconditioned 
conjugate gradient method that can be efficiently Implemented on both 
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vector computers and parallel arrays. This method takes m steps of a 
linear stationary Iterative method derived from a symmetric and nonsingu- 
lar splitting of the stiffness matrix K In order to precondition the system. 
In Theorem 1. we extend a result of Dubois, Qreenbaum. and 
Rodriquetl979] by giving the necessary and sufficient conditions for the 
resulting preconditioning matrix, M. to be symmetric and positive definite. 
In Theorem 2. we relate the positive definiteness of M to the conver- 
gence of the linear stationary iterative method and thereby generalize the 
Jacobi Convergence Theorem. 

In Theorem 3. the condition number of the preconditioned system. 
*(K m >. was proven to be a decreasing function of m If all the eigen- 
values of G. the Iteration matrix for the linear stationary Iterative method, 
are positive. However. If the smallest eigenvalue of G Is negative, the 
condition number behaves differently for m odd and m even. In partic- 
ular. Theorem 4 shows if X n > | ^ -j ( - the condition number is decreasing 
for m odd. but Is decreasing for m even if and only If the following 
Inequality holds 

<1+ | x.,| ' n+ 1 )n-x™><u-o' n xi-x™ +1 > 

where 0=max | X^ J . This means that Increasing m from an odd to a 

consecutive even number of steps is more beneficial, in some cases, 
than increasing m from an even to a consecutive odd number of steps. 
These results further explain observations of Dubois. Greenbaum, and 
RodrfqueC19791. 

The most promising linear stationary iterative method that we used 
for the m-step preconditioner was the Multi-color SSOR method. Numer- 
ical results show that the ratio of the number of Iterations with the 2- 
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step Multi-color SSOR PCQ method to the 1-step Multi-color SSOR PCG 
method was 1.40(1.36) for the plane stress problem and Laplace's equa- 
tion respectively (the theoretical maximum Is 2.0). To improve these 

results, we developed an m-step extrapolated PCG method that can be 
effective whenever all the eigenvalues of G are nonnegative (as is true 
for SSOR). Numerical results with this method show the ratios 1.40(1.36) 
are reduced to 1.93(1.76) respectively with little additional work. The 
disadvantage of using this method is that little theory exists for determin- 
ing the extrapolation factor y. 

Finally, In Chapter 6, we compared our m-step extrapolated PCG 
method to the Parametrized Preconditioned Conjugate Gradient Method 
(PPCG) of Johnson. Mlcchelli. and Paul[19823 and showed the two are 
equivalent whenever m= 2. For m> 2. the PPCG method appears more 

general since the freedom of choosing more than one parameter can 
possibly lead to a better preconditioner. By using our more general 

symmetric and nonsingular splitting of the matrix K. we showed how to 
generalize the PPCG method. More research Is required to determine 
the effectiveness of this approach. 

In Chapter 7. we developed a model for comparing the execution 
time of parallel algorithms on an array computer. This model included 
the time for arithmetic, local convergence testing, synchronization for 
decision making, sending and receiving values ever a global bus, and 
performing a summation of p numbers via the global bus or alternatively 
by a special hardware circuit. The hardware times for doing one of 

each of the above operations was varied to determine the dependence of 
an algorithm's performance on these parameters. The model was vali- 
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dated on a 4-processor Finite Element Machine at NASA Langley 
Research Center; however, as more processors are added to this 
machine, a more detailed validation can be done. 

The model was used to predict speedups as well as execution times 
for our algorithms for Laplace's equation and the plane stress problem 
on a p-processor machine where a respresents the ratio of communica- 
tion to arithmetic time. The major results are Itemized below for 
Laplace's equation. 

(1) R/B SOR was the most efficient and fastest algorithm for all 
values of p. The speedup for 384 processors was 307 for 
a=.01. 295 with a=.l. 211 with a=l and as low as 55 with 
a=10. 

(2) The conjugate gradient method with the global bus for the Inner 
products should not be used with a large number of processors 
unless a<.01. 

(3) Some Improvement over the CG(Bus) algorithm is obtained by 
taking two steps of the Red/Black SSOR preconditioner; however, 
more Improvement is gotten by using the sum/max hardware 
circuit for the Inner products with this 2-step method but not 
enough to be competitive with the Red/Black SOR algorithm. 

The major results are Itemized below for the plane stress problem. 

(1) The Red/Black/Qreen SOR algorithm, even though very efficient 
on an array, takes too many iterations to converge to be com- 
petitive with the conjugate gradient methods. 

(2) The Red/Black/Qreen 2-step extrapolated SSOR preconditioned 
conjugate gradient algorithm with the sum/max circuit for the 
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inner products was the fastest method for this problem and 
quite efficient as well. The speedup values for 256 processors 
are 251 for a=.01. 242 for a=.l. 179 for a=l, and only 50 for 
a=10. 

(3) The execution time for the method In (2) above varied very little 
when a was increased from .1 to 1 but varied significantly for 
p>16 when a was Increased from 1 to 10. Therefore, if a 
parameter study were done to determine the value of a for 
design purposes, extra model runs should be done between the 
range a=l and a=10. 

In Chapter 7. we showed that the reliability of a p-processor array 
decreases as the value of a increases. For example with p= 256. a job 
length of 100 time units, and a component failure rate of 1 every time 
unit, the reliability decreases from 90% for a=. 1 to 87% for a=l to 60% 
for a=10. Finally. In Chapter 7. we outlined the procedure for compar- 
ing our algorithms with Cholesky decomposition followed by forward and 
backward substltfon on a conventional computer. 

8.2. Future Directions 

The efficient implementation of the Multi-color SOR and the m-step 
SSOR preconditioned conjugate gradient methods on an array of proces- 
sors depends on the coloring of the nodes, of the discretization followed 
by a particular mapping, or assignment, of the problem nodes to the 
processors. We gave in Chapter 5 the solution to this assignment prob- 
lem for the special case of a rectangular problem domain. However, for 
Irregular regions, the coloring of the nodes corresponds to a graph 
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coloring problem and In general Is an NP-complete problem (see McDi- 
armld[19791. and More '[19813 for examples). Furthermore. Bokhari[1979] 
showed that assigning p nodes to p processors in order to reduce the 
communication time Is also an NP-complete problem for a general prob- 
lem domain. However. Bokhari did not consider the assignment of multi- 
ple nodes per processor. We note that the assignment of nodes to the 
processors is not independent of the solution algorithm used to solve the 
system of linear equations and for our algorithms must be viewed in 
conjunction with the coloring problem. 

A second area for further research Is the comparison of block and 
point Iterative methods for parallel processors. Because of the overhead 
due to communication that was seen for our point methods, block 
methods may be competitive on these machines since the processor will 
become more computationally bound. These methods may prove effective 
for structural engineering problems since they are closely related to the 
modular or substructuring approach that Is commonly used by structural 
engineers. 

A third Important area for the extension of our Ideas Is in the 
development of new Iterative algorithms such as asynchronous methods 
and multi-grid techniques. The asynchronous methods of Baudet(l978] 
were Implemented on a multi-processor system with central shared 
memory. For the distributed memory multiple Instruction multiple data 
Finite Element Machine, research is needed to determine if these 
methods are competitive with synchronous methods. For example, the 
Multi-color SOR method can be Implemented In an asynchronous fashion, 
thereby eliminating the wait time due to the synchronous receive and the 
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overhead due to some global flag checking, but it has yet to be deter- 
mined whether the time saved by this overhead outweighs the possible 
Increase In time If more Iterations are necessary for convergence. We 
note that the current conjugate gradient methods can not run asynchro- 
nously since synchronization Is necessary to accumulate the partial sums 
for the Inner products. 

An efficient multi-grid method for an architecture like the Finite Ele- 
ment Machine Is another area for future research. This method requires 
relaxation on different sized problem grids and therefore may require a 
different communication strategy than the eight nearest neighbor connec- 
tions. In addition, a parallel iterative method for performing the smooth- 
ing relaxations must also be developed. 
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